A reoccurring question around Open XML is how to search and replace text in a word-processing document. There have been several attempts at presenting example code to do this, however, until now I have not seen any examples that correctly implement this. This post presents some example code that implements a correct algorithm to search and replace text.
The first challenge is handle the case when the text you are searching for spans runs with different formatting. A simple example will demonstrate the problem. You want to replace ‘Hello World’ with ‘Hi World’. If, in the document, the word ‘World’ is bolded, then the markup will look something like this:
<w:p> <w:r> <w:t xml:space="preserve">Hello </w:t> </w:r> <w:r> <w:rPr> <w:b /> </w:rPr> <w:t>World</w:t> </w:r> </w:p>
Even though the search text spans runs, the algorithm should find the text and replace it. The next challenge is to define exactly the semantics of searching and replacing text if the text that you are searching for spans runs with different formatting. In short, the replaced text takes on the run formatting of the run that contains the first character of the search string. An example makes this clear. In the following sentence, the first four characters of the word ‘include’ are bolded:
On the Insert tab, the galleries include items.
If you replace ‘include’ with ‘do not include’, then the sentence should be formatted like this:
On the Insert tab, the galleries do not include items.
The replaced text takes on the formatting of the ‘i’ character of include, which was bolded.
Here is a short screen-cast that walks through the algorithm and the code.
Search and Replace Algorithm
It certainly would be possible to carefully define an algorithm to search for text that spans runs, noting where the searched text intersects bookmarks, comments, and the like. However, this algorithm would be pretty complicated, and to be done properly, a test team would need to write extensive test specs, and supply a plethora of sample documents that exercise all edge cases. It is a non-trivial project.
However, there is another approach that we can take that is pretty simple, easy to test, and yields the correct results in all cases. The algorithm consists of:
It will be helpful to walk through an example, and examine the markup at each step in the process. The following paragraph contains the text, “See this markup.” The letters ‘th’ in the word ‘this’ is bolded. We want to change the word ‘this’ to the word ‘the’.
<w:p> <w:r> <w:t xml:space="preserve">See </w:t> </w:r> <w:r> <w:rPr> <w:b /> </w:rPr> <w:t>th</w:t> </w:r> <w:r> <w:t>is markup.</w:t> </w:r> </w:p>
After splitting all runs into multiple runs of a single character each, the markup looks like this:
<w:p xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"> <w:r> <w:t>S</w:t> </w:r> <w:r> <w:t>e</w:t> </w:r> <w:r> <w:t>e</w:t> </w:r> <w:r> <w:t xml:space="preserve"> </w:t> </w:r> <w:r> <w:rPr> <w:b /> </w:rPr> <w:t>t</w:t> </w:r> <w:r> <w:rPr> <w:b /> </w:rPr> <w:t>h</w:t> </w:r> <w:r> <w:t>i</w:t> </w:r> <w:r> <w:t>s</w:t> </w:r> <w:r> <w:t xml:space="preserve"> </w:t> </w:r> <w:r> <w:t>m</w:t> </w:r> <w:r> <w:t>a</w:t> </w:r> <w:r> <w:t>r</w:t> </w:r> <w:r> <w:t>k</w:t> </w:r> <w:r> <w:t>u</w:t> </w:r> <w:r> <w:t>p</w:t> </w:r> <w:r> <w:t>.</w:t> </w:r> </w:p>
The algorithm can then iterate through the runs, finding the series of runs where the text of the runs matches ‘t’, ‘h’, ‘I’, ‘s’. The algorithm then inserts a new run containing the replace text, taking the run properties from the run that contained the ‘t’ in the search string, which indicates that the run is bolded. It also removes the single character runs that match the search string. The adjusted markup looks like this.
<w:p xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"> <w:r> <w:t>S</w:t> </w:r> <w:r> <w:t>e</w:t> </w:r> <w:r> <w:t>e</w:t> </w:r> <w:r> <w:t xml:space="preserve"> </w:t> </w:r> <w:r> <w:rPr> <w:b /> </w:rPr> <w:t>the</w:t> </w:r> <w:r> <w:t xml:space="preserve"> </w:t> </w:r> <w:r> <w:t>m</w:t> </w:r> <w:r> <w:t>a</w:t> </w:r> <w:r> <w:t>r</w:t> </w:r> <w:r> <w:t>k</w:t> </w:r> <w:r> <w:t>u</w:t> </w:r> <w:r> <w:t>p</w:t> </w:r> <w:r> <w:t>.</w:t> </w:r> </w:p>
Finally, the algorithm iterates through the runs, coalescing adjacent runs with identical formatting.
<w:p xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"> <w:r> <w:t xml:space="preserve">See </w:t> </w:r> <w:r> <w:rPr> <w:b /> </w:rPr> <w:t>the</w:t> </w:r> <w:r> <w:t xml:space="preserve"> markup.</w:t> </w:r> </w:p>
There are a few additional notes worth mentioning about this algorithm.
Hi again Eric, I have one other questions about the search and replace code. I have found that it does not work if the text to be replaced is inside of a Text Box. I have tried stepping through the code to work out why but have been unsuccessful. Do you know why this would be?
Hi Mort, You are right - the algorithm neglected to process descendant paragraphs of a paragraph. I've updated the code. It now works properly.
-Eric
Thanks again Eric. It works for me now. This SearchAndReplace class has been invaluable to a project I'm working on. I'm in the transition from Lotus Notes developer to Sharepoint developer and resources like this website have made things alot easier.
Hi Eric,
I've been having a play with the code sample, and noticed that it doesn't seem to correctly replace instances of the search string in a document after they're used in a text box. So if, for example, I have the following document:
hello world
[text box] hello world [/text box]
...and am replacing the string 'hello' with 'hi', the result I get is...
hi world
[text box] hi world [/text box]
I'm pretty sure this is not the intended behaviour! Otherwise it's a very helpful article, thanks :)
-Zaxian
Hi Zaxian,
I am not seeing the symptoms you mention, but I am sure that it is due to me not creating the document properly. Would you kindly send me a sample document where text is not being replaced properly? You can either post it here on the forum, or send it as a message on OpenXMLDeveloper.org, or send to my email eric at ericwhite.com.
Thanks
Recently I wrote some code that implemented search-and-replace for Open XML WordprocessingML documents
First of all Thank you very much for providing the very usefull knowledge and code for office automation using OpenXML.
I am Sanjeev, just started using OpenXml for editing the office document. I found your very usefull article and code to search and replace the text in word files. It is very good and very usefull for me to complete the tasks. Thank you for that.
I am using Search and replace functionality by placing some place holders in the documents. I succeeded to replace the place holders, but only thing I am not able do is to replace the place holders with the text having multiple lines.
I tried by adding "W:br" node in the XmlElement, but still it is not giving the text on next line.
Am i missing some thing? Please help ......
Thanks in advance.
Thanks and Regards,
Sanjeev
Hi Eric
First of all thank for the search and replace code, but the problem is that I can not make it work for my example.
here's the sample document.xml:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:ve="schemas.openxmlformats.org/.../2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="schemas.openxmlformats.org/.../relationships" xmlns:m="schemas.openxmlformats.org/.../math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp="schemas.openxmlformats.org/.../wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="schemas.openxmlformats.org/.../main" xmlns:wne="schemas.microsoft.com/.../wordml"><w:body><w:p w:rsidR="009001B7" w:rsidRDefault="009001B7" w:rsidP="009001B7"><w:pPr><w:pStyle w:val="Heading1"/><w:rPr><w:lang w:val="en-CA"/></w:rPr></w:pPr><w:r><w:rPr><w:lang w:val="en-CA"/></w:rPr><w:t>The first testing form</w:t></w:r></w:p><w:p w:rsidR="009001B7" w:rsidRDefault="009001B7" w:rsidP="009001B7"><w:pPr><w:rPr><w:lang w:val="en-CA"/></w:rPr></w:pPr></w:p><w:p w:rsidR="005404F9" w:rsidRDefault="009001B7" w:rsidP="009001B7"><w:pPr><w:rPr><w:lang w:val="en-CA"/></w:rPr></w:pPr><w:r><w:rPr><w:lang w:val="en-CA"/></w:rPr><w:t>This is a test</w:t></w:r><w:r w:rsidR="005404F9"><w:rPr><w:lang w:val="en-CA"/></w:rPr><w:t>ing document for data field1</w:t></w:r><w:proofErr w:type="gramStart"/><w:r w:rsidR="005404F9"><w:rPr><w:lang w:val="en-CA"/></w:rPr><w:t>: !!</w:t></w:r><w:proofErr w:type="gramEnd"/><w:r><w:rPr><w:lang w:val="en-CA"/></w:rPr><w:t xml:space="preserve">Plan </w:t></w:r><w:proofErr w:type="spellStart"/><w:r><w:rPr><w:lang w:val="en-CA"/></w:rPr><w:t>Member.First</w:t></w:r><w:proofErr w:type="spellEnd"/><w:r><w:rPr><w:lang w:val="en-CA"/></w:rPr><w:t xml:space="preserve"> Name</w:t></w:r><w:r w:rsidR="005404F9"><w:rPr><w:lang w:val="en-CA"/></w:rPr><w:t>!!</w:t></w:r></w:p><w:p w:rsidR="009001B7" w:rsidRDefault="005404F9" w:rsidP="009001B7"><w:pPr><w:rPr><w:lang w:val="en-CA"/></w:rPr></w:pPr><w:r><w:rPr><w:lang w:val="en-CA"/></w:rPr><w:t xml:space="preserve"> </w:t></w:r><w:proofErr w:type="spellStart"/><w:r w:rsidR="009001B7"><w:rPr><w:lang w:val="en-CA"/></w:rPr><w:t>Datafile</w:t></w:r><w:proofErr w:type="spellEnd"/><w:r w:rsidR="009001B7"><w:rPr><w:lang w:val="en-CA"/></w:rPr><w:t xml:space="preserve"> 2</w:t></w:r><w:proofErr w:type="gramStart"/><w:r w:rsidR="009001B7"><w:rPr><w:lang w:val="en-CA"/></w:rPr><w:t xml:space="preserve">: </w:t></w:r><w:r><w:rPr><w:lang w:val="en-CA"/></w:rPr><w:t>!!</w:t></w:r><w:proofErr w:type="gramEnd"/><w:r w:rsidR="009001B7"><w:rPr><w:lang w:val="en-CA"/></w:rPr><w:t xml:space="preserve">Plan </w:t></w:r><w:proofErr w:type="spellStart"/><w:r w:rsidR="009001B7"><w:rPr><w:lang w:val="en-CA"/></w:rPr><w:t>Member.Last</w:t></w:r><w:proofErr w:type="spellEnd"/><w:r w:rsidR="009001B7"><w:rPr><w:lang w:val="en-CA"/></w:rPr><w:t xml:space="preserve"> Name</w:t></w:r><w:r><w:rPr><w:lang w:val="en-CA"/></w:rPr><w:t>!!</w:t></w:r></w:p><w:p w:rsidR="009001B7" w:rsidRDefault="009001B7" w:rsidP="009001B7"><w:pPr><w:rPr><w:lang w:val="en-CA"/></w:rPr></w:pPr></w:p><w:p w:rsidR="009001B7" w:rsidRDefault="009001B7" w:rsidP="009001B7"><w:pPr><w:rPr><w:lang w:val="en-CA"/></w:rPr></w:pPr><w:r><w:rPr><w:lang w:val="en-CA"/></w:rPr><w:t>End of testing Document.</w:t></w:r></w:p><w:p w:rsidR="009001B7" w:rsidRDefault="009001B7" w:rsidP="009001B7"><w:pPr><w:rPr><w:lang w:val="en-CA"/></w:rPr></w:pPr></w:p><w:p w:rsidR="00924DD5" w:rsidRDefault="00924DD5"/><w:sectPr w:rsidR="00924DD5" w:rsidSect="00924DD5"><w:pgSz w:w="12240" w:h="15840"/><w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="708" w:footer="708" w:gutter="0"/><w:cols w:space="708"/><w:docGrid w:linePitch="360"/></w:sectPr></w:body></w:document>
SearchAndReplacer.SearchAndReplace(TargetForm, "!!Plan Member.First Name!!","Frank");
Can you please let me know what the problem is?
Regards,
Tony
Sanjeev, I encountered a similar problem trying to insert a <w:br /> in between lines of text. My approach was not as elegant as yours since I was just adding the "<w:br />" string at the end of the text as I built the multiple lines. Through my debugging, I found that the SearchAndReplace was converting the "<" and ">" to "<" and ">" and I couldn't figure out how to prevent it. What I ended up doing was using a Regex.Replace to put back the "<" and ">" characters. Probably not the best way of doing things, but I had to get it working. Perhaps Eric or someone else here can provide a better solution.
Hi eric, can you please point the differences between this version of the code and the one that ignored the textboxes? I need to ignore textboxes in searches.
Michael
Thanks for the article I have successfuly used your code in my app.
I have come across a bug when replacing the text with an empty string.
At line 149 "if (replace[0] == ' ' || replace[replace.Length - 1] == ' ')" an index out of bounds exception is rasied.
I fixed this by wrapping the above inside if (replace.Length > 0) {...}
Mike Williams
Correction to earlier post
I have now changed the wrapping code to if (!string.IsNullOrEmpty(replace)) {...} to hanlde null strings as well as empty strings.
Regards