David Overton's Blog and Discussion Site
This site is my way to share my views and general business and IT information with you about Microsoft, IT solutions for ISVs, technologists and businesses, large and small. I specialise in Windows Intune and SBS 2008.
This blog is purely the personal opinions of David Overton. If you can't find the information you were looking for e-mail me at admin@davidoverton.com.

To find out more about my Windows Intune BOOK - Microsoft Windows Intune 2.0: Quickstart Administration click here

To find out more about my SBS 2008 BOOK - Small Business Server 2008, Installation, Migration and Configuration click here

Office from the past, ODF and OOXML (Office of today and tomorrow) and why is organic growth nearly always bad for software and why re-writing is not good either
David Overton's Blog

Buy my books

Windows Intune:Quickstart Administration


This is the RAW book (Read as Written).
Click here for more information
Buy or pre-order today

SBS 2008 - Installation, Migration and Configuration

Small Business Server 2008 – Installation, Migration, and Configuration

Buy today in book or e-book form

Request a Review Copy

Twitter

Syndication

As I have said many times in the past I used to write document conversation tools.  I believe this gives me a valid reason to be able to pass comment on the ODF/OOMXL debate that is raging at the moment.  If these types of questions interest you, have a look at the book I talk about later (In Search of Stupidity: Over 20 Years of High-Tech Marketing Disasters (UK) or here for US link)

Lets start with some history about the Office 97-2003 file formats.  Joel was writing about this today (Why are the Microsoft Office file formats so complicated? (And some workarounds) - Joel on Software ).  Note that part of what he has commented on is the fact that the documentation for the binary file format is now available from Microsoft:

Why are the Microsoft Office file formats so complicated? (And some workarounds)

This item ran on the Joel on Software homepage on Tuesday, February 19, 2008

Last week, Microsoft published the binary file formats for Office. These formats appear to be almost completely insane. The Excel 97-2003 file format is a 349 page PDF file. 

<snipped>

If you started reading these documents with the hope of spending a weekend writing some spiffy code that imports Word documents into your blog system, or creates Excel-formatted spreadsheets with your personal finance data, the complexity and length of the spec probably cured you of that desire pretty darn quickly. A normal programmer would conclude that Office’s binary file formats:

  • are deliberately obfuscated
  • are the product of a demented Borg mind
  • were created by insanely bad programmers
  • and are impossible to read or create correctly.

You’d be wrong on all four counts. With a little bit of digging, I’ll show you how those file formats got so unbelievably complicated, why it doesn’t reflect bad programming on Microsoft’s part, and what you can do to work around it.

The first thing to understand is that the binary file formats were designed with very different design goals than, say, HTML.

They were designed to be fast on very old computers. For the early versions of Excel for Windows, 1 MB of RAM was a reasonable amount of memory, and an 80386 at 20 MHz had to be able to run Excel comfortably. There are a lot of optimizations in the file formats that are intended to make opening and saving files much faster:

  • These are binary formats, so loading a record is usually a matter of just copying (blitting) a range of bytes from disk to memory, where you end up with a C data structure you can use. There’s no lexing or parsing involved in loading a file. Lexing and parsing are orders of magnitude slower than blitting.
  • The file format is contorted, where necessary, to make common operations fast. For example, Excel 95 and 97 have something called “Simple Save” which they use sometimes as a faster variation on the OLE compound document format, which just wasn’t fast enough for mainstream use. Word had something called Fast Save. To save a long document quickly, 14 out of 15 times, only the changes are appended to the end of the file, instead of rewriting the whole document from scratch. On the hard drives of the day, this meant saving a long document took one second instead of thirty. (It also meant that deleted data in a document was still in the file. This turned out to be not what people wanted.)

<snipped>

They have to reflect all the complexity of the applications. Every checkbox, every formatting option, and every feature in Microsoft Office has to be represented in file formats somewhere. That checkbox in Word’s paragraph menu called “Keep With Next” that causes a paragraph to be moved to the next page if necessary so that it’s on the same page as the paragraph after it? That has to be in the file format. And that means if you want to implement a perfect Word clone than can correctly read Word documents, you have to implement that feature. If you’re creating a competitive word processor that has to load Word documents, it may only take you a minute to write the code to load that bit from the file format, but it might take you weeks to change your page layout algorithm to accommodate it. If you don’t, customers will open their Word files in your clone and all the pages will be messed up.

<Joel continues with lots of good information at Why are the Microsoft Office file formats so complicated? (And some workarounds) - Joel on Software>

Now, we have some history and perspective, lets look at the future.  People ask why Microsoft can't just use ODF.  Well, if we want to start with the really, really, really simple answer, Joel stated "They have to reflect all the complexity of the applications", which is something that ODF falls HUGELY short of.  There are so many items that ODF does not do or limit that you would have to massively embrace and extend the standard, which has been stated as undesirable many times.

So, what are Microsoft's options?

  1. Create a new standard that does enable all current functionality and allow for future expansion
  2. Not bother sharing it's file formats
  3. Take ODF and loose much of the Office functionality
  4. Take ODF and spend years adding to it until it enables current Office functionality, remember that changes have to be made by committee, so just because Office does it is not a good enough argument for the standard to be changed

If it was your software company, which would you choose in today's climate?  As a user, sharing files is about all you care about, in which case, option 2 is fine if everyone is using Office from Microsoft.  If not, then I guess it would have to be option 1.

Of course, there are those who seem to think that Microsoft wanted to tie people in - Vijay says as much, but if that was true, why did Microsoft remove the biggest incentive to upgrade from one version of Office to the next (Upgrade or your files will not be compatible) when then made Office 97 the 1st of a series of products that could share files without any requirement for the user to upgrade?  Automated conversion tools were available to enable you to bulk move files between vendors (I wrote some of them) and the documentation about the binary format was sufficient to do my job.  Nope, Microsoft decided on the interoperability path before they released Office 97.  XML was not an option then, so the shell approach they chose was a good option (each version of Office adds another layer to the shell of the Office files.  Each version just reads up to the point it no longer understands).

Now lets talk about organic growth.  The Office binary file format is based on organic growth.  Product managers only have so much insight into the architecture that will be needed in 2,3,5 years time, so things end up getting bolted on and being less architecturally pure with each addition.  The alternative is to re-architect and re-write.  This is discussed extensively in the book In Search of Stupidity: Over 20 Years of High-Tech Marketing Disasters (UK) or here for US link where they point out that all the growth has known attributes, so all the code has hundreds or thousands of bug fixes in it to work with the file formats.  Start from scratch and you in effect loose all those years accumulated bug fixes and knowledge, so your next product will be buggy.  Read the book and look at the chapter on Netscape - it suggests that this is what killed them!

 

Ok, so that is the history lesson, and the fact that newer formats are all documented should please many Smile

 

ttfn

 

David


Posted Tue, Feb 19 2008 7:06 PM by David Overton

Comments

Vijay Singh Riyait wrote re: Office from the past, ODF and OOXML (Office of today and tomorrow) and why is organic growth nearly always bad for software and why re-writing is not good either
on Tue, Feb 19 2008 10:19 PM

Dave, when you explain it, it all sounds so reasonable! I can accept that the MS standard is superior and that we should all adopt it but this constant sniping at each other's standards doesn't serve the interest of customers. I also pointed out (which you didn't hightlight) the stupidity of trying to stop MS getting it accepted as an ISO standard. Microsoft has been found gulity of anti-competitive behaviour by both the US Government and the EC by a due process. I believe Microsoft is learning from these experiences and changing but it's a big beast to change and maybe what it needs is people like yourself to educate your colleagues about interoperability and openness.

David Overton wrote re: Office from the past, ODF and OOXML (Office of today and tomorrow) and why is organic growth nearly always bad for software and why re-writing is not good either
on Tue, Feb 19 2008 11:06 PM

Vijay,

Of course I make it sound so reasonable Wink.  This is a dance and unfortunately this is the way it goes.  I don't think Microsoft is scared any more than any business is of losing business.  However I think we need to stop banding around terms like "guilt of anti-competitive behaviour" when in each case it is a very specific section of the business, rather than the whole company.  I could go into Windows N, but I don't think now is the time.

As for XML, look at your history - Microsoft moved to a common document format in 1997, still support many document formats in our products.  Microsoft moved to XML as a strategy in 2000 and pioneered the industry move to Web Services and XML.  Remember "dot NOT".  That was the reaction of the people you are upholding as leaders to XML based solutions.

Don't get suckered (by any side) into thinking this game is ALWAYS about winning business and nothing else.  IBM still creates more patents per year than any other company.  Only allows people it does not consider a threat tp use its IP and so on.  Do you think IBM didn't know they were going to re-release Symphony based on ODF when they were campaigning for ODF?  They were really just shouting "use my file format, not theirs".

I did re-read your blog post and I don't see you defending MS on this, but that is the joy of the written and unambiguous language that English is Smile

ttfn

Add a Comment

(optional)  
(optional)
(required)  
Remember Me?

(c)David Overton 2006-13