Welcome to OpenXML Developer Sign in | Join | Help

How to get images from docx file ( document.xml)

Last post 10-07-2008, 5:07 AM by beast. 8 replies.
Sort Posts: Previous Next
  •  08-02-2008, 6:04 PM 3526

    How to get images from docx file ( document.xml)

    I have a docx file consisting of  images .

    Help me out in getting images from the [document.xml] file. I am using xsl to parse the docx file to html.

    The part of the code (xsl) is wriiten down here.

    <xsl:for-each select="//w:sdt/w:sdtContent">

          <xsl:for-each select="w:p/w:r">

             <p><xsl:value-of select="w:t"/></p>

             <xsl:for-each select="w:p/w:r" >

                      <xsl:for-each select="w:drawing/a:graphic/a:graphicData/pic:pic">

                               <xsl:for-each select="pic:nvPicPr/pic:cNvPr">

                                     <img alt="" src="{$imgSrc}"/>

                               </xsl:for-each>

                      </xsl:for-each>

          </xsl:for-each>

    </xsl:for-each>

    <p><xsl:value-of select="w:r/w:t"/></p>

    </xsl:for-each>

    I am passing the [imgSrc] in <xsl:param name="imgSrc"/>

    This code works for the text part but the issue is that I am not able to get the image at all.

    Please help me out.

  •  08-04-2008, 9:14 AM 3528 in reply to 3526

    Re: How to get images from docx file ( document.xml)

    Hi,
    Image parts are not stored in the document.xml.
    Extract files from your document and check in /word/media.
    All your images are here.
    What you can get from the document.xml file is the RID of the image you want:

    <
    pic:blipFill>
       <a:blip r:embed="rId4" />
    </
    pic:blipFill>

    With this rid you can find the target in /word/_rels/document.xml.rels :

    <Relationship Id="rId4" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="media/image1.jpeg" />

    And voila you have your image.

    Regards.




  •  08-27-2008, 9:22 AM 3615 in reply to 3528

    Re: How to get images from docx file ( document.xml)

    thanx Kirikou
    I am trying to generate a html file exactly the same as of what is being there in docx. I am using java and xsl to serve the purpose. Everything is working as par requirement but the issue is still with images which I want to put there in the html. Please do help me once again.
    Regards,
    broshni
  •  08-30-2008, 1:30 PM 3622 in reply to 3615

    Re: How to get images from docx file ( document.xml)

    Hi,

    I have implemented this. So if you have already notfigures this out, I can tell you how I did t. But I am working with C#, not with JAVA.

     

    By the way, if you have designed the XSLT conversion of bullets and numbering, please do share with me, coz that is the point where I am stuck with currently.

    /Suralk

  •  09-01-2008, 7:26 AM 3628 in reply to 3622

    Re: How to get images from docx file ( document.xml)

    Hi folks,
    the /word/_rels/document.xml.rels contains the relationship for the image I am trying extract.


    now the issue is in extracting the Relationship Id="rId4" which is actually image part I am trying to put in the html to be generated.
    please do suggest me how to extract the document.xml.rels so I can iterate though the file to extract the required target.
    any comment would be highly appreciative.
    regards,
    broshni
  •  09-02-2008, 12:25 PM 3636 in reply to 3628

    Re: How to get images from docx file ( document.xml)

    hi,

    there is no such thing as extracting the document.xml.rels. What u can do is, u can extract the packageRelationship corresponding to that and write it into a folder as an image file. then u can create a reference o that from your html

    are you using C3 .Net? If so, can give you a code snippet to do that.

     

    /Suralk

  •  09-02-2008, 2:45 PM 3638 in reply to 3636

    Re: How to get images from docx file ( document.xml)

    Im not sure if this is the same for Word '07, but in Word '03 xml many pictures were stored inside <w:binData> with Base64Endconding. To pull these out (in Java), parse the xml document and call these two methods:

    /**
         *Extracts all images from a parsed MS-XML Document
         *
         *@param xmlDoc the MS-XML-Document (before transform)
         */
        public void extractImages(Document xmlDoc)
        {
            NodeList binDataList = xmlDoc.getElementsByTagName("w:binData");
          
            String fileName = "";
            Node currentNode;
            for(int i = 0; i < binDataList.getLength(); i++)
            {
                currentNode = binDataList.item(i);
                if(currentNode.getNodeType() == Node.ELEMENT_NODE && ((Element)currentNode).hasAttribute("w:name"))
                {               
                    File newImageFile = new File(picDirectory, ((Element)currentNode).getAttribute("w:name").replaceFirst("wordml://", ""));
                    if(newImageFile.exists())
                    {
                      
                    }
                    else
                    {
                        if(writeImage(newImageFile, currentNode))
                        {
                            //Print some success message
                        }
                    }
                }
            }
        }
       
        /**
         *Writes the content of the w:bin node to the specified file
         *
         *@param toFile File to write bin data to
         *@param binNode the node containing the bin data
         */
        public boolean writeImage(File toFile, Node binNode)
        {
            FileOutputStream dataOut = null;
            try
            {
                dataOut = new FileOutputStream(toFile);
               
                ByteArrayInputStream bis = new ByteArrayInputStream(binNode.toString().replaceAll("<[^<]+?>", "").getBytes());
                BASE64Decoder decoder = new BASE64Decoder();
                decoder.decodeBuffer(bis, dataOut);
                dataOut.flush();
            }
            catch(Exception e)
            {
               
                return false;
            }
            finally
            {
                dataOut.close();
            }
            return true;
        }

    The problem with the above code is extracting .wmz, .emz, and .mso data. These are compressed, but when trying to use java.util.zip, etc. it does not like the encoding/compression. I have tried decoding with the BASE64Decoder (sun.misc.BASE64Decoder) then decompressing it, however it does not like this either....
  •  09-27-2008, 6:07 PM 3743 in reply to 3638

    Re: How to get images from docx file ( document.xml)

    You can use docx4j to manipulate the images in Java.  docx4j is open source (ASL).  There are several posts on our forums about how to do this.
  •  10-07-2008, 5:07 AM 3766 in reply to 3743

    Re: How to get images from docx file ( document.xml)

    Hi

     

    You can also use Aspose.Words for extracting pictures from the Word document.

    http://www.aspose.com/documentation/file-format-components/aspose.words-for-.net-and-java/extract-images-from-a-document.html

     

    There are .NET and Java version of this tool.

     

    WBR,

    Alex

View as RSS news feed in XML