As I have said many times in the past I used to write document conversation tools. I believe this gives me a valid reason to be able to pass comment on the ODF/OOMXL debate that is raging at the moment. If these types of questions interest you, have a look at the book I talk about later (In Search of Stupidity: Over 20 Years of High-Tech Marketing Disasters (UK) or here for US link)
Lets start with some history about the Office 97-2003 file formats. Joel was writing about this today (Why are the Microsoft Office file formats so complicated? (And some workarounds) - Joel on Software ). Note that part of what he has commented on is the fact that the documentation for the binary file format is now available from Microsoft:
Why are the Microsoft Office file formats so complicated? (And some workarounds)
This item ran on the Joel on Software homepage on Tuesday, February 19, 2008
Last week, Microsoft published the binary file formats for Office. These formats appear to be almost completely insane. The Excel 97-2003 file format is a 349 page PDF file.
If you started reading these documents with the hope of spending a weekend writing some spiffy code that imports Word documents into your blog system, or creates Excel-formatted spreadsheets with your personal finance data, the complexity and length of the spec probably cured you of that desire pretty darn quickly. A normal programmer would conclude that Office’s binary file formats:
- are deliberately obfuscated
- are the product of a demented Borg mind
- were created by insanely bad programmers
- and are impossible to read or create correctly.
You’d be wrong on all four counts. With a little bit of digging, I’ll show you how those file formats got so unbelievably complicated, why it doesn’t reflect bad programming on Microsoft’s part, and what you can do to work around it.
The first thing to understand is that the binary file formats were designed with very different design goals than, say, HTML.
They were designed to be fast on very old computers. For the early versions of Excel for Windows, 1 MB of RAM was a reasonable amount of memory, and an 80386 at 20 MHz had to be able to run Excel comfortably. There are a lot of optimizations in the file formats that are intended to make opening and saving files much faster:
- These are binary formats, so loading a record is usually a matter of just copying (blitting) a range of bytes from disk to memory, where you end up with a C data structure you can use. There’s no lexing or parsing involved in loading a file. Lexing and parsing are orders of magnitude slower than blitting.
- The file format is contorted, where necessary, to make common operations fast. For example, Excel 95 and 97 have something called “Simple Save” which they use sometimes as a faster variation on the OLE compound document format, which just wasn’t fast enough for mainstream use. Word had something called Fast Save. To save a long document quickly, 14 out of 15 times, only the changes are appended to the end of the file, instead of rewriting the whole document from scratch. On the hard drives of the day, this meant saving a long document took one second instead of thirty. (It also meant that deleted data in a document was still in the file. This turned out to be not what people wanted.)
They have to reflect all the complexity of the applications. Every checkbox, every formatting option, and every feature in Microsoft Office has to be represented in file formats somewhere. That checkbox in Word’s paragraph menu called “Keep With Next” that causes a paragraph to be moved to the next page if necessary so that it’s on the same page as the paragraph after it? That has to be in the file format. And that means if you want to implement a perfect Word clone than can correctly read Word documents, you have to implement that feature. If you’re creating a competitive word processor that has to load Word documents, it may only take you a minute to write the code to load that bit from the file format, but it might take you weeks to change your page layout algorithm to accommodate it. If you don’t, customers will open their Word files in your clone and all the pages will be messed up.
<Joel continues with lots of good information at Why are the Microsoft Office file formats so complicated? (And some workarounds) - Joel on Software>
Now, we have some history and perspective, lets look at the future. People ask why Microsoft can't just use ODF. Well, if we want to start with the really, really, really simple answer, Joel stated "They have to reflect all the complexity of the applications", which is something that ODF falls HUGELY short of. There are so many items that ODF does not do or limit that you would have to massively embrace and extend the standard, which has been stated as undesirable many times.
So, what are Microsoft's options?
- Create a new standard that does enable all current functionality and allow for future expansion
- Not bother sharing it's file formats
- Take ODF and loose much of the Office functionality
- Take ODF and spend years adding to it until it enables current Office functionality, remember that changes have to be made by committee, so just because Office does it is not a good enough argument for the standard to be changed
If it was your software company, which would you choose in today's climate? As a user, sharing files is about all you care about, in which case, option 2 is fine if everyone is using Office from Microsoft. If not, then I guess it would have to be option 1.
Of course, there are those who seem to think that Microsoft wanted to tie people in - Vijay says as much, but if that was true, why did Microsoft remove the biggest incentive to upgrade from one version of Office to the next (Upgrade or your files will not be compatible) when then made Office 97 the 1st of a series of products that could share files without any requirement for the user to upgrade? Automated conversion tools were available to enable you to bulk move files between vendors (I wrote some of them) and the documentation about the binary format was sufficient to do my job. Nope, Microsoft decided on the interoperability path before they released Office 97. XML was not an option then, so the shell approach they chose was a good option (each version of Office adds another layer to the shell of the Office files. Each version just reads up to the point it no longer understands).
Now lets talk about organic growth. The Office binary file format is based on organic growth. Product managers only have so much insight into the architecture that will be needed in 2,3,5 years time, so things end up getting bolted on and being less architecturally pure with each addition. The alternative is to re-architect and re-write. This is discussed extensively in the book In Search of Stupidity: Over 20 Years of High-Tech Marketing Disasters (UK) or here for US link where they point out that all the growth has known attributes, so all the code has hundreds or thousands of bug fixes in it to work with the file formats. Start from scratch and you in effect loose all those years accumulated bug fixes and knowledge, so your next product will be buggy. Read the book and look at the chapter on Netscape - it suggests that this is what killed them!
Ok, so that is the history lesson, and the fact that newer formats are all documented should please many
Tue, Feb 19 2008 7:06 PM