[Microsoft's New Win-Win Strategy: Post 2 of 5]
As a successful software architect, I've learned to recognize the results of a poorly managed design process. In Windows Vista: Past Its Due Date Already, I gave some insight into why Windows matches a "software implosion pattern" I've seen before. This post explores why Microsoft really should scrap the codebase, the next post will suggest the controversial idea that Microsoft should scrap Windows in favor of Linux.
The recognition that the code base is in trouble is no secret, and for the most part even Microsoft agrees. Microsoft has a team called "The Windows Code Excellence Team". They have a program for "driving broad changes efficiently into the Windows codebase". They call these "Strike Force Efforts" and the very name they chose reveals the adversarial relationship that even Microsoft insiders perceive with their own code base.
But still, Microsoft thinks they can fix it. This post is about why they can't.
The Recipe Was Wrong From The Start
The biggest single reason why Microsoft can't fix the problem is because Windows archtecture is flawed at its very foundation. They're sinking money into repairs on an old car, trying to make it fuel efficient, trying to make it conform to current needs. But as we all know, sometimes you need to buy a new one.
Understanding why this is true is technically challenging for most people. The real issues are understood only by well-educated computer scientists. As a result, most of the common discussions around Windows architecture talk about the visible aspects of Windows, rather than the technical underpinnings. And, because the technical world is so full of geeks that argue pointlessly (a la SlashDot), there are rarely any efforts to bring clarity to Windows most severe problem.
There are ample sources of technical information about this. While it may seem like a Windows vs. Mac argument, Daniel Eran's excellent (if lengthy) article "Five Architectural Flaws in Windows Solved in Mac OS X" is one of the best examples. What can easily be missed about this article is that 4 out of 5 points Dan makes are directly the result of Apple's decision to use Unix as an architectural basis for OS X. If Dan were talking about MacOS 9, only the first of his five arguments would apply.
So, by adopting Unix, Apple was able to push their own technology into the next generation.
Both anecdotal evidence and qualified research are available to illustrate the problem. Security researchers can point squarely at flawed API design as a primary reason why Windows architecture actually "encourages insecure applications". Those who remember Fredrick Brooks classic book "The Mythical Man-Month" see the same trip-ups in Vista that Brooks warned about in his book.
Because Windows architecture has fundamental design flaws, Microsoft is constantly adding layer after layer of new technologies to achieve what would be implicit in a better architecture. This increases the amount of work, technology, and conceptual baggage that developers need to learn and master. Thus, it decreases the relability of overall applications. Developers have a limited amount of mental energy. Any Windows developer knows that negotiating Windows "idiosyncrasies" takes up almost as much time as the application itself.
The idiosyncrasies themselves become entire "worlds of new technology" Knowing how the Windows event loop operates, and arcane topics such as using "assemblies" as a way to avoid "DLL Hell" become distinct new art forms. The Windows community has become so accustomed to this continual "band aid" approach that they don't even recognize the original problem. Instead, the new technologies become job skills, and further serve to separate Windows programming expertise and culture from the larger world of shared computer science knowledge. (See an earlier post of mine for more on this "disconnect").
Now and then, some Windows developer asks the right questions (from the InfoWorld gripeline):
I am surprised that still, after all these years, that Windows has not seen the solution that UNIX (and probably many OS's) takes to DLL Hell -- use versioned DLL files so something linked against an old DLL will use the old one while something linked against the new one will use the new one. Viola. Problem solved.
Continual layers of technology to solve architectural problems leads to the next problem with the Windows codebase ...
Escalating Complexity
In a recent New York Times article, the following appeared:
Several thousand engineers have labored to build and test Windows Vista, a sprawling, complex software construction project with 50 million lines of code, or more than 40 percent larger than Windows XP.
Windows is growing beyond even Microsoft's ability to manage it. In October 2004, Martin Taylor, then Microsoft GM of Platform Strategy, admitted that changes were needed and introduced a new "role-based" strategy for reducing the size of the code-base:
"Today, it's still the entire code base. There's no reduction in the bits you get; things are just roped off," Taylor said Friday. "We want to get to a model of role-based deployment where you might just have the bits you need for that function. ... It's one of our design goals for Longhorn."
In that ComputerWorld article, Taylor was talking about a new agenda for Longhorn to trim the size of the code base. The article later says
A Microsoft spokeswoman confirmed that the goals of providing a smaller Windows "footprint" are to cut maintenance costs and provide a "reduced surface attack area."
Martin is now part of the Windows Live team. The new GM of platform strategy, Bill Hilf, hasn't said a word about it in the last 18 months. So much for trying. If a smaller code-base was one of the design goals for Longhorn, it seems Microsoft has decided to put the idea (and Martin Taylor) on the back burner for now.
Windows 3.1 had 2.5 million lines of code, Windows 95 had 15 million, XP has 40 million, and Vista will have at least 50 million. Microsoft managers are stumped about how to reduce the complexity and size, and Michael Cherry, former Microsoft product manager, says "It's such a collection of smart people that they've started to believe too much in themselves". (MercuryNews)
Not only is the problem growing, but the team is looking the other way.
Reinventing Microsoft Software Culture
It's not enough to throw out the codebase. Here's the real challenge: You need to retool the very culture which allowed it to happen. Before starting over, you need to figure out what (mis-)management style allowed Windows XP's excellent security foundation to be completely disabled by other arms of the organization. Microsoft already tried to re-invent Windows once with Windows NT. The problem is, while one set of OS experts in Microsoft is devising an excellent security framework, another set of "experts" is violating all the rules in the interest of "dumbing down the features" for users. Security guru Steve Gibson (quoted in WindowsITPro) explains the phenomenon:
"With Windows 2000 you could argue that Microsoft was at least preserving the original NT security model," Gibson noted. "Regular users would log on as Administrator only when doing system tasks like installing applications or bug fixes, and then log on as a regular user to get work done. This is much like a UNIX machine, where the root account is tweaked very carefully, not generally used for day-to-day work. But Microsoft moved that NT security model to the home and gave Administrator power to users. [The company] discarded the traditional security model because it was too hard to explain to users..."
Aaron Margosis (a Microsoft employee) has an entire blog dedicated to the topic of trying to help people run Windows as "Non-Admin". Despite his excellent advice, it is impossible for the average user to comprehend, much less follow such instructions when all of the default settings of Windows, and the expectations of third-party software are that average users will be running as Administrator. As a result, Microsoft's excellent security model lies in the background, gathering dust, while clever hackers throughout the world know that it's open season for attacking the average user's desktop. All of this is no accident, and is engineered into the product---introduced by well-meaning product managers attempting to make things simpler while elsewhere in the organization people know better.
It is as if Microsoft is releasing products by "trial and error". They even recognize the "non-admin" problem and are moving to add yet another layer of complexity called User Account Control to Vista to help solve this problem. But, even on the second attempt, it's not looking good. Beta users are annoyed and claiming it is far too complex and intrusive:
In its current incarnation, too many people are likely to dismiss [User Account Control] completely, and if that happens, everyone loses. [Ed Bott -ZDNet]
That Microsoft has launched a major product with an major introduced security flaw is the most brazen sort of incompetence. That they are still not getting it right reveals something much worse: ignorance. While scrapping the codebase is essential, it's equally essential to establish new rules when moving forward.
Margolis final paragraph in one of his articles is among the most interesting. Probably the biggest reason why Windows XP is so vulnerable, and that so many people run as root, is cultural:
Hey, y’all! We need to lead by example. People look to us for best practices, for the right way to do things. We are trying to convince the world that we are thought leaders in software and in software security. In the Unix world, they never run as root except when necessary. They “su”, do what they need to do, and revert back. We are not leaders when we run as root all the time. Comrades: you need to run as “User”, and your customers need to see you doing it. If you run into issues, don’t add yourself back to the admins group – file a bug against the offending product. Customers: if you see any MS sales, MCS, Premier, PSS, etc., doing web or email as admin, please tell them, “You’re not setting a very good example. I am disappointed.”
Spaghetti
While some experts agree that the above flaws are proven facts, I suspect there are more people in the industry that consider these things quite subjective. As we all know, a picture is worth a thousand words. In April, a ZDNet article with the dubious title "Why Windows is less secure than Linux" included two diagrams generated by Sana Security and shown below. These diagrams received little attention, but compare graphically how Windows and Linux process the service of a single web page.
The first diagram illustrates how Windows processes the page:
The second illustrates how Linux processes the page:
The orderly arrangement of the Linux traces are no accident. They are the result of years of thinking which goes back to the origins of Unix itself. Good operating system design assures that the operating system and application layers are distinct---separated by boundaries which are like immovable walls. Such walls manage the compexity of systems by isolating operations from one another to minimize dependencies. To a good system designer, these are not just guidelines, they are dogma.
To any good systems architect, the traces of the Windows diagram are like a giant black spot on your MRI. They represent an undisciplined and haphazard set of interelationships resulting from years of unsystematic development and support of legacy code and processes.
Little Hope of Repair
There is hardly any hope of repair for such systems. The new Windows Vista may eventually work, but it will be by brute force testing and a bit of sleight of hand---not good design. Microsoft, however, is trying to fix it. On his blog, Microsoft employee Larry Osterman describes the problem:
As systems get older, and as features get added, systems grow more complex. The operating system (or database, or whatever) that started out as a 100,000 line of code paragon of elegant design slowly turns into fifty million lines of code that have a distinct resemblance to a really big plate of spaghetti.
This isn't something specific to Windows, or Microsoft, it's a fundamental principal of software engineering. The only way to avoid it is extreme diligence - you have to be 100% committed to ensuring that your architecture remains pure forever.
It's no secret that regardless of how architecturally pure the Windows codebase was originally, over time, lots of spaghetti-like issues have crept into the product over time.
Larry's right about the problem. But unfortunately there is no way to turn back the hands of time and retroactively make sure that the architecture was pure from the start. Yet, he goes on to describe how Microsoft has developed internal tools "that perform static analysis of the windows binaries and they work out the architectural and engineering dependencies between various system components". The hope is that by knowing which layers should be isolated and why, changes can be put in place which fix the problem and eliminate the spagetti.
Then, unfortunately, Larry goes into the tall grass when he says:
Well, most of the layering issues can be resolved via email, but for some set of issues, email just doesn't work - you need to get in front of people with a whiteboard so you can draw pretty pictures and explain what's going on.
Software architecture may be an interesting thing to talk about in email or on the whiteboard. But such naive attempts will not make the sweeping architectural changes that are necessary to yield noticable improvement. Only good design, enforced by software tools and disciplined coding practices, can result in well-layered systems with managable complexity. Much of the windows code itself predates the very tools and practices needed to fix it. For the fundamental design of Windows to change, you need to go back to the drawing board.
Even if you believed, for a moment, that you could check every line of code and fix every dependency, the math would get you. As the number of function points increases, the number of side-effects and dependencies increases exponentially. Even a small software system can have millions of interdependent relationships. A large system like Vista with 50 million lines of code would have side-effects and causal relationships that would defy analysis.
Confidence Building
In March, when Vista slipped, article after article appeared about the whys and whens of the slip. The popular jourlalism moved quickly into an editorial stance. The New York Times article "Burden of the years weighs on Windows" set the stage:
"Windows is now so big and onerous because of the size of its code base, the size of its ecosystem and its insistence on compatibility with the legacy hardware and software, that it just slows everything down," noted David Yoffie, a professor at Harvard Business School. "That's why a company like Apple has such an easier time of innovation."
Microsoft was uncharacteristically silent. I believe, finally, there could be no disagreement.
Whether the problem is as egregious as I say, there certainly is the belief that it has reached a turning point in its lifecycle. Consumer confidence in Windows behavior has waned, and now the recognition that the underlying operating system is to blame is becoming widely accepted.
If, as part of a bold new strategy, Microsoft announced that the Vista codebase was the end-of-the-line, confidence in future solutions could finally increase. Instead of fighting the past, the talented teams of Microsoft engineers would be learning from their mistakes. As it is, there is far too much code and far too many problems for them to do anything other than trudge forward, making it work as best they can.
Conclusions
The Windows codebase is in bad shape. It's unlikely that Microsoft, or indeed anyone, can fix it. I am certain they will create a usable version of Vista. But, I expect that one year after its release, we will not be looking back, happy that the problems are solved. Instead, such an albatross of design can only yield new problems, and new challenges for Microsoft.
Get rid of it, and replace it. But with what? In the next post (coming in a couple days), I'll suggest that Linux be a part of a new strategy to revitalize the product line. It's good for Microsoft in more ways than you think.
Wow. At first I though this was a joke, or maybe some delusional ramblings (like Dvorak), but you are actually serious about this.
First of all, the basics:
* Windows isn't popular because it's cool; it's popular because of the millions of apps that run on it. Any replacement for Windows would either have to run the millions of apps that already exist or would have to be cool enough to get a half-billion people to be willing to replace all their apps with new ones.
* Windows has about a half-billion users, most of whom are largely computer illiterate and administer their own computer. Even if there are 100,000,000 Windows experts out there, that still leaves 80% of the user base mostly clueless. Any system that would replace Windows will still have hundreds of millions of users who don't know what they're doing.
OK, so now that all of that's out of the way, let's see what history has to tell us:
* Back in the 1980s, Microsoft sold 3 different major OSes. First they ported Unix to microcomputers (Xenix). Then they came out with DOS for the IBM PC. When it became obvious that PC users didn't want Unix and needed more power than DOS, they came out with OS/2. When it became obvious that OS/2 wasn't going to meet future market needs, they started working on NT. When it came out, Windows 95 was mainly competing against Unix, OS/2, NT, and MacOS. MacOS was the only one of those which wasn't at some point a Microsoft product, yet they could not get users to stop using the damn Win9x product line for 10 years after WinNT was launched!
* Back in the 90s Apple realized that the original Mac architecture that dated back to 1984 was getting to be a real liability and started looking for replacements. Pink, Taligent, Copland, and Rhapsody should all sound familiar to Macophiles. In 1997, Apple bought NeXT, finally bringing out OSX over 4 years later. That's right, it took 4 years to rewrite MacOS on a Unix kernel. So how was Apple able to pull off the miraculous user migration from OS9 to OSX? Well, they had a compatibility layer for older apps, few people wanted to put up with the preemptive multitasking and 50s-era memory management of OS9, their user base had dwindled considerably by 2000, Mac users like buying the latest cool thing from Apple, and Apple simply stopped making hardware that would run the older OS.
All right, so now you are proposing that Microsoft should come out with an OS based on the Linux kernel without a Windows compatibility layer? Well, they already tried that -- it's called Linux! The reason people aren't abandoning Windows in droves is because Linux (or any other OS) doesn't run their Windows apps.
So let's say that they have to put in a Windows compatibility layer so that people with Windows apps will actually use it. Well, they already tried that too -- it's called Windows NT! The reason it took 10 years for MS to migrate users from Win9x to WinNT is that NT couldn't run nearly as many DOS apps (games in particular), it was slower on the resource-restricted hardware of the day, and users didn't like dealing with the security.
So let's say we have this magic combination of Unix and Windows compatibility. MS can't make OEMs stop making hardware compatible with the old Windows, and the old Windows isn't so hideously outdated that people don't want to use it. How, then, would MS make everybody switch?
So let's assume that MS could possibly make an OS so cool that a half-billion users all can't resist upgrading. You now have the same problem of having a half-billion uneducated users. You will still have security problems with viruses, Trojans, etc. Every OS has security problems (the terms "worm" and "rootkit" originated from Unix!), but only Windows has enough users to make the latest virus show up on the evening news. The fundamental problem is that no matter how "secure" you make the OS, there will always be the problem between the keyboard and chair. As I said before, even if you have a hundred million gurus, there's still 80% of the user population who has no problem entering their root password in order to see naked celebrity photos or listen to the latest CD. Hell, a lot of spyware is willingly installed by the user (like Kazaa or Comet Cursor).
I could go on, but the argument is moot. The problems you have with Windows aren't in the kernel. The Windows kernel (ntoskrnl.exe) on my XPsp2 system is barely 2MB. Microsoft could easily just stop shipping the Win32 subsystem with Vista and ship only the POSIX subsystem. It would be great at running Unix apps (if you recompile them, of course), but it wouldn't run Windows apps anymore. At that point it wouldn't really be Windows.
Microsoft could make all the "Next Generation" technologies it wants, but the hard part (the part that takes a long time) is making those technologies compatible with Windows so that people will actually want to use them.
Posted by: Gabe | June 01, 2006 at 07:10 AM
Great comments, Gabe. I do actually agree with most all your points. The compatibility issues as well as user adoption issues are a significant reason why MS feels that they "must" do certain things. MS has to release a new OS this is actually the "path of least resistance" compared to sticking with an old release. Reversing the trend is important.
The compatibility layer issues are also important. While I advocate that MS "rewrite", "don't port", my last post in this series will lighten up on this a bit. My main concern with compatibility layers is that sometimes they provide so much of a crutch that a new OS almost looks crippled. Rosetta is such an example right now. You can buy a new MacBook, which is supposed to be about 6 times faster than the old Powerbook G4. But, Photoshop runs 4 times slower on the new hardware!
I believe compatibility is not an absolute. The requirement that "a billion apps need to run as is" is making it harder and harder for MS to move forward. Risks have to be taken that SOME apps will not run, and of course the challenge is to decide which, why, and when that will occur.
Most of your points are about the obstacles to scrapping and replacing, not about the benefits. It is these very obstacles that held up companies like WordPerfect and Ashton-Tate and Lotus and allowed their products to be replaced. All three of them felt that there was "no compromise" about what they had to do to satisfy their users. All three believed that their existing user base was their most important asset. They all felt that they had such dominance that there was no possibility that another product would woo their users away so long as they "kept delivering a compatible upgrade path".
My primary assumption is based on the idea that Windows will continue to become slower, and larger, and harder to deliver so long as the compatibility requirement is imposed. And, that other products will continue to increase in appeal. And, that at some unexpected and surprising point, a competitor will release a product that, for the first time, is a viable alternative and it will be so dramatically better that there people move in masses away form Windows. That has happened repeatedly. That is the end for Microsoft. And, to think that Microsoft is "invincible" is an error that no business can afford to make.
If the Microsoft dream is that on the Starship Enterprise, crew members can still run their old DOS applications from 4 centuries ago, then something is very wrong. Somewhere the line must be drawn.
But, your comment is great. Don't get me wrong, I don't think any of this is easy.
Posted by: Gary Wisniewski | June 01, 2006 at 07:52 AM
I'm still skeptical. Windows is, almost by definition, that which runs Windows applications. The more compatibility that MS leaves out in order to streamline the system, the less "Windows" it becomes.
As far as I can tell, though, backwards compatibility does not make the system measurably slower or larger, just harder to deliver. I think it's safe to say that it's actually new features that make the system larger and slower. If you don't want new features, why upgrade?
Now you may have a good point here, but look closely at history. MS remade Windows as WinNT, released as version 3.1 in 1993. It was late, fairly slow, and had lots of missing features. Version 3.5 came out in 1994, 3.51 in 1995, and 4.0 in 1996, with each version being faster and more featureful than the last. Apple remade MacOS as OS X, released in 2001. It was late, fairly slow, and had lots of missing features. They released 4 more versions over the next 4 years, with each version being faster and more featureful than the last. I think it's a safe bet that any redesign of Windows would be late, slower than the previous version, and be missing many features.
You also make a good point about Lotus, Ashton-Tate, and WordPerfect. If you look at why they failed, you'll see that it was mostly because they botched their only successful product. When that product stagnated (1-2-3 and WordPerfect just made Windows wrappers around their old DOS programs; dBase IV was dead in the water before Windows was even an issue), it was the end. Compatibility wasn't the issue. In fact, Word and Excel succeeded in spite of (and because of) having to be compatible with WordPerfect and 1-2-3. If MS is smart, they are constantly looking back on their former competitors and taking steps to make sure they don't end up in the same position. If you look at Excel, to this very day pressing the slash key activates the menu because that's how Lotus worked.
Posted by: Gabe | June 02, 2006 at 07:01 AM
Hmmm...I just read that ZDNet "article" with the dubious title "Why Windows is less secure than Linux". Did you notice at the very end where he says "This is a blog entry not a news article."?
Also, that's a very bad example for security purposes. For one thing, it doesn't indicate if they're for the same features. If the IIS graph includes an analysis of more features, it should be more complicated. Also, there's nothing to say whether the difference is Windows vs. Linux or IIS vs. Apache. Maybe you'd be just as secure if you ran Apache on Windows, in which case it would be IIS that's insecure rather than Windows.
Of course if you look at the charts at http://blogs.msdn.com/michael_howard/archive/2004/10/15/242966.aspx you'll see that IIS 6 has had exactly 2 security advisories in the past 3.5 years, compared with Apache 2.0.x which had 28. And while that may be a Microsoft blog, it's current data from Secunia.
Posted by: Gabe | June 03, 2006 at 08:49 PM
Gabe:
I do not agree with the compatability argument at all. A typical user only needs something that has the same UI and feel of Windows. The underlying architecture is not seen by them.
A typical home user would buy Windows, and Microsoft Office, some games, and use a couple of other smaller applications such as Adobe Reader. The only way this compatability argument would stand is if the typical user dearly holds on to 10 different applications from the 1990s that no longer have any developers to update them. This simply is not the case at all. If people want to continue running software designed for XP, they can keep running XP! Developers can use the years of cross over to redesign the underlying code of their apps, while keeping the same feel if they wish.
In terms of organisations needing backward compatability from a cost perspective (the cost of having to redesign everything), and also demanding an improved Windows, well that is more of an argument. But as the performance of computer hardware rapidly increases, it makes compatability layers more practical. Computers these days already give maximum productivity. There is not much more to improve on when you can access a database instantly. So it is not like there is a need to make the absolute most efficient use of modern hardware advances.
I know of no legitimate reason why Microsoft has to retain backwards compatability with every new release of Windows.
Having said that, I don't have any knowledge to be commenting on whether the Windows codebase is irreversably complex spaghetti code and needs to be scrapped. I am merely pointing out that backwards compatability is not a reason to keep the existing codebase.
There are some signs of FUD in these articles. Such as the notion that Windows is becoming slower over time. These elements leave me a bit skeptical. But on the other side, Microsoft are not leaving me feeling "confident" (one of the Vista marketing words I think?) at all.
Posted by: BobTurbo | June 28, 2006 at 12:58 AM
Even if the architecture was flawed, and it was redesigned, I am wondering what benefit it would have, apart from being "ideal".
I doubt a consumer OS is ever going to be very secure. A company delivers a product that people want. If you look at how people approach the security of their homes, in many countries people don't even bother locking their doors every time they leave the house. All that is usually protecting a house is a few doors and some glass windows. So really, I think that Windows XP pretty much meets the security expectations of most people. The reason Windows is popular would not be even greater reasons if the OS was redesigned. If redesigning improves speed in every single area by 3%, that is absolutely nothing and already wiped out by the next Intel chip that comes out. If redesigning improves security by 20%, the OS is still insecure. 20% means nothing. You would need to gain a huge increase in this area for it to be significant. Taking the consumer OS or the Internet out of the equation would do this much more effectively.
How is Microsoft going to tell shareholders: Hey we are going to spend a billion dollars on something that gives no benefit to the consumer, but it will make the diagrams much neater.
I think what gets lost in all of these technical debates is the purpose of the product to the consumer.
Maybe Microsoft is better off concentrating on delivering value to the consumer, rather than designing the perfect OS that serves no additional purpose.
On the other hand, maybe redesigning the OS could drastically improve security and benefit the end user? I don't know.
Posted by: BobTurbo | June 28, 2006 at 02:16 AM
Anyway I think Microsoft are going to scrap the codebase eventually and replace it with whatever they are researching. I think they will have a much better system that is pretty much light years ahead of anything that is currently available such as Linux, and will be based on all of the lessons they have learned. But that will be a long way off, and is not extremely urgent. My first reactions to your series of posts was that Vista was a complete waste of time and we deserve something better. But after thinking about it, I would say that refining 2003/XP was the best option at this time.
Maybe in about 6 years we will have a new codebase.
Posted by: BobTurbo | June 29, 2006 at 02:33 AM