I generate xlsx file from my java application. I do not use any third-part libraries cause xlsx is the zipped set of xml files and I need only basic functionality of SpreadsheetML. I need to create very large documents(for about 350000 rows and 120 columns). My synthetic tests with fake data work fine, working with real data retrieved from database works fine too for the data retrieved from small select queries(about 20000 rows in the result set). But when I try to create document with the whole dump of my table (that's about 350000 rows), my documents are opened with the 'Excel found unreadable content' error. The first thing I have thought about were the special characters (like control characters) in the data. I check for them and change for the space, but the problem doesn't disappeared. Xml files itself are valid according to several XML validation tools I used.I also have no information about possible string values stored in the database from the business view (for example, maximum size, available characters, string formatting and so on). Data strings are rather complex to analyze it. So my question is: do you know any office open xml limitations that can cause this problem? Cells amount, cell value size or something else... I don`t know, really...Thank you very much in advance.
I would try to identify if it is a problem caused by the data. I would try dividing the initial data into 20000 row groups and try each group to see if some have problems and some don't. If only some have problems, then you can continue to subdivide the problem group into smaller data sets until you have identified the exact data problem. This can be a bit tedious, but if you keep dividing and subdividing into groups of about 10, you should be able to quickly identify the row and, if necessary, the column that is causing the problem.
If all the smaller groups work fine, then it is unlikely to be a specific data problem. Then the question is why the big fake data document works when the real data document does not. In that case, I would need to know more about how the fake data compares to the real data. Are there longer text strings in the real data, for example? If there is some size threshold that is causing the issue, you can try scaling back the size in increments to see if there is a particular point where it works. For example, try limiting to 300000 rows, then 200000. If 300000 fails, but 200000 works, then try 250000. Then try the same thing with another set of data and see if it has the same threshold. This approach will not work well until you have verified that the data is not a problem because then it might work just because the bad data is being left out.
bobm, yes, I got the same idea too and I work in that way now but I still search simultaneously for the more elegant decision.