[Microsoft's New Win-Win Strategy: Post 2 of 5]
As a successful software architect, I've learned to recognize the results of a poorly managed design process. In Windows Vista: Past Its Due Date Already, I gave some insight into why Windows matches a "software implosion pattern" I've seen before. This post explores why Microsoft really should scrap the codebase, the next post will suggest the controversial idea that Microsoft should scrap Windows in favor of Linux.
The recognition that the code base is in trouble is no secret, and for the most part even Microsoft agrees. Microsoft has a team called "The Windows Code Excellence Team". They have a program for "driving broad changes efficiently into the Windows codebase". They call these "Strike Force Efforts" and the very name they chose reveals the adversarial relationship that even Microsoft insiders perceive with their own code base.
But still, Microsoft thinks they can fix it. This post is about why they can't.
The Recipe Was Wrong From The Start
The biggest single reason why Microsoft can't fix the problem is because Windows archtecture is flawed at its very foundation. They're sinking money into repairs on an old car, trying to make it fuel efficient, trying to make it conform to current needs. But as we all know, sometimes you need to buy a new one.
Understanding why this is true is technically challenging for most people. The real issues are understood only by well-educated computer scientists. As a result, most of the common discussions around Windows architecture talk about the visible aspects of Windows, rather than the technical underpinnings. And, because the technical world is so full of geeks that argue pointlessly (a la SlashDot), there are rarely any efforts to bring clarity to Windows most severe problem.
There are ample sources of technical information about this. While it may seem like a Windows vs. Mac argument, Daniel Eran's excellent (if lengthy) article "Five Architectural Flaws in Windows Solved in Mac OS X" is one of the best examples. What can easily be missed about this article is that 4 out of 5 points Dan makes are directly the result of Apple's decision to use Unix as an architectural basis for OS X. If Dan were talking about MacOS 9, only the first of his five arguments would apply.
So, by adopting Unix, Apple was able to push their own technology into the next generation.
Both anecdotal evidence and qualified research are available to illustrate the problem. Security researchers can point squarely at flawed API design as a primary reason why Windows architecture actually "encourages insecure applications". Those who remember Fredrick Brooks classic book "The Mythical Man-Month" see the same trip-ups in Vista that Brooks warned about in his book.
Because Windows architecture has fundamental design flaws, Microsoft is constantly adding layer after layer of new technologies to achieve what would be implicit in a better architecture. This increases the amount of work, technology, and conceptual baggage that developers need to learn and master. Thus, it decreases the relability of overall applications. Developers have a limited amount of mental energy. Any Windows developer knows that negotiating Windows "idiosyncrasies" takes up almost as much time as the application itself.
The idiosyncrasies themselves become entire "worlds of new technology" Knowing how the Windows event loop operates, and arcane topics such as using "assemblies" as a way to avoid "DLL Hell" become distinct new art forms. The Windows community has become so accustomed to this continual "band aid" approach that they don't even recognize the original problem. Instead, the new technologies become job skills, and further serve to separate Windows programming expertise and culture from the larger world of shared computer science knowledge. (See an earlier post of mine for more on this "disconnect").
Now and then, some Windows developer asks the right questions (from the InfoWorld gripeline):
I am surprised that still, after all these years, that Windows has not seen the solution that UNIX (and probably many OS's) takes to DLL Hell -- use versioned DLL files so something linked against an old DLL will use the old one while something linked against the new one will use the new one. Viola. Problem solved.
Continual layers of technology to solve architectural problems leads to the next problem with the Windows codebase ...
In a recent New York Times article, the following appeared:
Several thousand engineers have labored to build and test Windows Vista, a sprawling, complex software construction project with 50 million lines of code, or more than 40 percent larger than Windows XP.
Windows is growing beyond even Microsoft's ability to manage it. In October 2004, Martin Taylor, then Microsoft GM of Platform Strategy, admitted that changes were needed and introduced a new "role-based" strategy for reducing the size of the code-base:
"Today, it's still the entire code base. There's no reduction in the bits you get; things are just roped off," Taylor said Friday. "We want to get to a model of role-based deployment where you might just have the bits you need for that function. ... It's one of our design goals for Longhorn."
In that ComputerWorld article, Taylor was talking about a new agenda for Longhorn to trim the size of the code base. The article later says
A Microsoft spokeswoman confirmed that the goals of providing a smaller Windows "footprint" are to cut maintenance costs and provide a "reduced surface attack area."
Martin is now part of the Windows Live team. The new GM of platform strategy, Bill Hilf, hasn't said a word about it in the last 18 months. So much for trying. If a smaller code-base was one of the design goals for Longhorn, it seems Microsoft has decided to put the idea (and Martin Taylor) on the back burner for now.
Windows 3.1 had 2.5 million lines of code, Windows 95 had 15 million, XP has 40 million, and Vista will have at least 50 million. Microsoft managers are stumped about how to reduce the complexity and size, and Michael Cherry, former Microsoft product manager, says "It's such a collection of smart people that they've started to believe too much in themselves". (MercuryNews)
Not only is the problem growing, but the team is looking the other way.
Reinventing Microsoft Software Culture
It's not enough to throw out the codebase. Here's the real challenge: You need to retool the very culture which allowed it to happen. Before starting over, you need to figure out what (mis-)management style allowed Windows XP's excellent security foundation to be completely disabled by other arms of the organization. Microsoft already tried to re-invent Windows once with Windows NT. The problem is, while one set of OS experts in Microsoft is devising an excellent security framework, another set of "experts" is violating all the rules in the interest of "dumbing down the features" for users. Security guru Steve Gibson (quoted in WindowsITPro) explains the phenomenon:
"With Windows 2000 you could argue that Microsoft was at least preserving the original NT security model," Gibson noted. "Regular users would log on as Administrator only when doing system tasks like installing applications or bug fixes, and then log on as a regular user to get work done. This is much like a UNIX machine, where the root account is tweaked very carefully, not generally used for day-to-day work. But Microsoft moved that NT security model to the home and gave Administrator power to users. [The company] discarded the traditional security model because it was too hard to explain to users..."
Aaron Margosis (a Microsoft employee) has an entire blog dedicated to the topic of trying to help people run Windows as "Non-Admin". Despite his excellent advice, it is impossible for the average user to comprehend, much less follow such instructions when all of the default settings of Windows, and the expectations of third-party software are that average users will be running as Administrator. As a result, Microsoft's excellent security model lies in the background, gathering dust, while clever hackers throughout the world know that it's open season for attacking the average user's desktop. All of this is no accident, and is engineered into the product---introduced by well-meaning product managers attempting to make things simpler while elsewhere in the organization people know better.
It is as if Microsoft is releasing products by "trial and error". They even recognize the "non-admin" problem and are moving to add yet another layer of complexity called User Account Control to Vista to help solve this problem. But, even on the second attempt, it's not looking good. Beta users are annoyed and claiming it is far too complex and intrusive:
In its current incarnation, too many people are likely to dismiss [User Account Control] completely, and if that happens, everyone loses. [Ed Bott -ZDNet]
That Microsoft has launched a major product with an major introduced security flaw is the most brazen sort of incompetence. That they are still not getting it right reveals something much worse: ignorance. While scrapping the codebase is essential, it's equally essential to establish new rules when moving forward.
Margolis final paragraph in one of his articles is among the most interesting. Probably the biggest reason why Windows XP is so vulnerable, and that so many people run as root, is cultural:
Hey, y’all! We need to lead by example. People look to us for best practices, for the right way to do things. We are trying to convince the world that we are thought leaders in software and in software security. In the Unix world, they never run as root except when necessary. They “su”, do what they need to do, and revert back. We are not leaders when we run as root all the time. Comrades: you need to run as “User”, and your customers need to see you doing it. If you run into issues, don’t add yourself back to the admins group – file a bug against the offending product. Customers: if you see any MS sales, MCS, Premier, PSS, etc., doing web or email as admin, please tell them, “You’re not setting a very good example. I am disappointed.”
While some experts agree that the above flaws are proven facts, I suspect there are more people in the industry that consider these things quite subjective. As we all know, a picture is worth a thousand words. In April, a ZDNet article with the dubious title "Why Windows is less secure than Linux" included two diagrams generated by Sana Security and shown below. These diagrams received little attention, but compare graphically how Windows and Linux process the service of a single web page.
The first diagram illustrates how Windows processes the page:
The second illustrates how Linux processes the page:
The orderly arrangement of the Linux traces are no accident. They are the result of years of thinking which goes back to the origins of Unix itself. Good operating system design assures that the operating system and application layers are distinct---separated by boundaries which are like immovable walls. Such walls manage the compexity of systems by isolating operations from one another to minimize dependencies. To a good system designer, these are not just guidelines, they are dogma.
To any good systems architect, the traces of the Windows diagram are like a giant black spot on your MRI. They represent an undisciplined and haphazard set of interelationships resulting from years of unsystematic development and support of legacy code and processes.
Little Hope of Repair
There is hardly any hope of repair for such systems. The new Windows Vista may eventually work, but it will be by brute force testing and a bit of sleight of hand---not good design. Microsoft, however, is trying to fix it. On his blog, Microsoft employee Larry Osterman describes the problem:
As systems get older, and as features get added, systems grow more complex. The operating system (or database, or whatever) that started out as a 100,000 line of code paragon of elegant design slowly turns into fifty million lines of code that have a distinct resemblance to a really big plate of spaghetti.
This isn't something specific to Windows, or Microsoft, it's a fundamental principal of software engineering. The only way to avoid it is extreme diligence - you have to be 100% committed to ensuring that your architecture remains pure forever.
It's no secret that regardless of how architecturally pure the Windows codebase was originally, over time, lots of spaghetti-like issues have crept into the product over time.
Larry's right about the problem. But unfortunately there is no way to turn back the hands of time and retroactively make sure that the architecture was pure from the start. Yet, he goes on to describe how Microsoft has developed internal tools "that perform static analysis of the windows binaries and they work out the architectural and engineering dependencies between various system components". The hope is that by knowing which layers should be isolated and why, changes can be put in place which fix the problem and eliminate the spagetti.
Then, unfortunately, Larry goes into the tall grass when he says:
Well, most of the layering issues can be resolved via email, but for some set of issues, email just doesn't work - you need to get in front of people with a whiteboard so you can draw pretty pictures and explain what's going on.
Software architecture may be an interesting thing to talk about in email or on the whiteboard. But such naive attempts will not make the sweeping architectural changes that are necessary to yield noticable improvement. Only good design, enforced by software tools and disciplined coding practices, can result in well-layered systems with managable complexity. Much of the windows code itself predates the very tools and practices needed to fix it. For the fundamental design of Windows to change, you need to go back to the drawing board.
Even if you believed, for a moment, that you could check every line of code and fix every dependency, the math would get you. As the number of function points increases, the number of side-effects and dependencies increases exponentially. Even a small software system can have millions of interdependent relationships. A large system like Vista with 50 million lines of code would have side-effects and causal relationships that would defy analysis.
In March, when Vista slipped, article after article appeared about the whys and whens of the slip. The popular jourlalism moved quickly into an editorial stance. The New York Times article "Burden of the years weighs on Windows" set the stage:
"Windows is now so big and onerous because of the size of its code base, the size of its ecosystem and its insistence on compatibility with the legacy hardware and software, that it just slows everything down," noted David Yoffie, a professor at Harvard Business School. "That's why a company like Apple has such an easier time of innovation."
Microsoft was uncharacteristically silent. I believe, finally, there could be no disagreement.
Whether the problem is as egregious as I say, there certainly is the belief that it has reached a turning point in its lifecycle. Consumer confidence in Windows behavior has waned, and now the recognition that the underlying operating system is to blame is becoming widely accepted.
If, as part of a bold new strategy, Microsoft announced that the Vista codebase was the end-of-the-line, confidence in future solutions could finally increase. Instead of fighting the past, the talented teams of Microsoft engineers would be learning from their mistakes. As it is, there is far too much code and far too many problems for them to do anything other than trudge forward, making it work as best they can.
The Windows codebase is in bad shape. It's unlikely that Microsoft, or indeed anyone, can fix it. I am certain they will create a usable version of Vista. But, I expect that one year after its release, we will not be looking back, happy that the problems are solved. Instead, such an albatross of design can only yield new problems, and new challenges for Microsoft.
Get rid of it, and replace it. But with what? In the next post (coming in a couple days), I'll suggest that Linux be a part of a new strategy to revitalize the product line. It's good for Microsoft in more ways than you think.