wordpress hit counter
Welcome to OpenXML Developer Sign in | Join | Help

Cutting WordML Documents into Individual Pages.

Last post 02-12-2009, 6:24 PM by cauld. 5 replies.
Sort Posts: Previous Next
  •  01-30-2009, 6:16 PM 4017

    Cutting WordML Documents into Individual Pages.

               Ok, as I understand it WordML documents don't really have "page breaks" as one would think.  Due to the changing size of fonts, pictures, and such the "page breaks" are kind of relative to where they were when the document was last paginated by Word.

               That said,  I am looking to divide word documents by page.  I have been looking over the OpenXML SDK v2.0 (and loving the amount of detail in the CHM ...).    I have not been able to find any sort of object/collection/property that can help me divide my docx documents by page.

               I am aware that if you look at the OpenXML for a WordML document you will see things like:
    <w:lastRenderedPageBreak />    OR it would seem   <w:br w:type="page" />

              With lastRenderedPageBreak being a relative break put in by Word on last pagination and  type="page" being a hard break inserted by a user.

              Now I am pretty sure I could spend a good deal of time and create some sort of XML parsing monster to hack and slash the heck out of the raw OpenXML and reassemble it into smaller chunks.   But I do not relish the idea of doing this much leg work if something already exists that does this.

              Is there some simple(ish) way of doing this?  I have looked through the OpenXML SDK v2.0 (to no avail).  I then looked around at the System.IO.Packaging but was unable to find anything more helpful (with this particular task).


    Any hints, tips, or tactics for this would be most helpful.

    Thanks!
  •  02-03-2009, 12:55 PM 4025 in reply to 4017

    Re: Cutting WordML Documents into Individual Pages.

    Well, had what I thought was a working solution.   I was looping through the document using the <w:lastRenderedPageBreak /> elements.    This was going fine till I realized that if you simply hit [Enter] to create the page break.  

    IE type type  enter enter {go over the page break} enter enter type type

    In that scenario you get absolutely no indication that there is a page break there in the XML.

    **DOCUMENT.XML**
    - <w:p w:rsidR="00266B4A" w:rsidRDefault="00266B4A" w:rsidP="00266B4A">
    -   <w:r>
          <w:t xml:space="preserve">All my fun TEXT.</w:t>
    </w:r>
    </w:p>
      <w:p w:rsidR="0061403F" w:rsidRDefault="0061403F" w:rsidP="00266B4A" />
      <w:p w:rsidR="00266B4A" w:rsidRDefault="00266B4A" w:rsidP="00266B4A" />
      <w:p w:rsidR="00266B4A" w:rsidRDefault="00266B4A" w:rsidP="00266B4A" />
      <w:p w:rsidR="00266B4A" w:rsidRDefault="00266B4A" w:rsidP="00266B4A" />
      <w:p w:rsidR="00266B4A" w:rsidRDefault="00266B4A" w:rsidP="00266B4A" />
      <w:p w:rsidR="00266B4A" w:rsidRDefault="00266B4A" w:rsidP="00266B4A" />   <-{page break}
      <w:p w:rsidR="00266B4A" w:rsidRDefault="00266B4A" w:rsidP="00266B4A" />
      <w:p w:rsidR="00266B4A" w:rsidRDefault="00266B4A" w:rsidP="00266B4A" />
      <w:p w:rsidR="00266B4A" w:rsidRDefault="00266B4A" w:rsidP="00266B4A" />
    - <w:p w:rsidR="00266B4A" w:rsidRDefault="00266B4A" w:rsidP="00266B4A">
    -     <w:r>
             <w:t xml:space="preserve">All my fun TEXT.</w:t>
      </w:r>
    </w:p>


            Notice the distinct lack of any indication of a page break.

            Still looking for suggestions on this.

    Thanks!
  •  02-08-2009, 4:09 PM 4040 in reply to 4025

    Re: Cutting WordML Documents into Individual Pages.

    Ralph,

    So the question comes down to what is meant by a page break.
    Obviously there are explicit breaks and you've worked out how to split by those already.
    The scenario you talk about is an implicit break. i.e. whether the break is visible really comes down to the client that is displaying the content- as an example of this, try looking at the document in the Word 2007 reading view and you'll find that the implicit page breaks happen in different places.

    You need to really think of the document as a continuous document and work with the explicit page breaks only.

    HTH
    Chris
  •  02-12-2009, 1:28 PM 4071 in reply to 4040

    Re: Cutting WordML Documents into Individual Pages.

    Chris,

        Thanks for the reply.

        This isn't really a good option.  Whenever people interact with documents in any way they have a very strong concept of "page".  ie   "What page is that on?", "Which page are you looking at?"etc ...

        In the same manner I need to have a concept of page for these documents.  Obviously anything that I cook up won't be exact, and it does not need to be.  If you get people close generally they are good to go.

        My current strategy is to run through each paragraph find its format and determine how many pixels tall and wide the text (or object) is.  With a running count I am keeping track of where I am.
    ie long strings have to wrap into multiple lines, pictures take up multiple lines etc...

        I am interested to see how this works as from a general over view it seems like it would be a very slow process, but it is just XML manipulation so it probably won't be slow.

    Wish me luck, and as always,  comments tips and tactics are more then welcome.

    Thanks,
    Ralph
  •  02-12-2009, 5:01 PM 4072 in reply to 4071

    Re: Cutting WordML Documents into Individual Pages.

    Baby steps it would seem.

         Along those lines I am trying to do said estimation and am at a bit of a road block.

    When attempting to get the page size I am grabbing the following section from the /word/document.xml file:

    <w:sectPr w:rsidR="006D7F8D" w:rsidRPr="006D7F8D" w:rsidSect="006D7F8D">
      <w:pgSz w:w="14400" w:h="14400" w:orient="landscape" w:code="1" />
      <w:pgMar w:top="0" w:right="0" w:bottom="0" w:left="0" w:header="720"
            w:footer="720" w:gutter="0" />
      <w:cols w:space="720" />
      <w:docGrid w:linePitch="360" />
    </w:sectPr>

            My issue is that I am unsure of what these numbers mean.  Mainly what units they are in.  I looked through the Open XML SDK but it was not terribly enlightening on the units/makeup of these numbers.  IE the 14400.
           
            I know the document is 10in x 10in  so I could work out a ratio but it would probably be better if I understood these numbers as apposed to hacking them.

    Thanks,
    Ralph

  •  02-12-2009, 6:24 PM 4073 in reply to 4072

    Re: Cutting WordML Documents into Individual Pages.

    Hi there Ralph.

    I alluded above to the implicit page model in OOXML (contrast this with say printing formats like PDF and XPS). This is obviously because different apps may want to render their breaks at different places. Think Word on a desktop vs an iPhone...

    So you've obviously got one approach that you're starting to work with.
    To answer your specific question here:
    The numbers are in Twips.
    See
    http://openiso.org/Ecma/376/Part4/2.6.13
    and
    http://openiso.org/Ecma/376/Part4/2.18.105

    A Twip is equivalent to 1/1440th of an inch.

    Another option you might consider if you want to create hard breaks that map to a specific application (e.g. Microsoff Word 2007) i8s to process the document in that tool and insert the breaks: e.g. use word automation to insert breaks for you.

View as RSS news feed in XML