How to reduce testing time and still improve quality

This is the conversation I've had the most frequently since joining Microsoft a few months ago.  The sides of the debate are framed like this:

People who have had experience in the world of online services will say that release time frames need to be short to remain competitive.  They'll cite agile methods with lightweight planning, and generally very light test plans.  People who have come from the packaged software (windows/office) side say that you end up cutting the corner on quality and by having longer releases can get more work at higher quality done.

Today I'll focus on the quality angle.  I argue that rapidly released software can be higher quality with a lower QA cost.  You can have your cake and eat it too.  To think through this, we need a measure of software quality.  That's an elusive problem, but for my purposes, we don't actually need to be perfect.  I propose:

  • Software Bugginess = Sum of cost of all bugs (that seems obvious)
  • Cost of a bug = Severity of bug * number of user sessions encountering it.
  • Severity is loosely defined as the amount of pain a bug causes to a user (and the part that is hard to quantify).
Now in the packaged software world, once the software is burned onto DVDs or pre-loaded on new computers for purchase, it's effectively frozen.  Oh yes, there are patches that get released, but those come with a fairly intrusive cost to users and are never 100% applied to all your customers.  So as a software development manager, to reduce the bugginess of your software you have to focus on getting rid of all high severity bugs.

Now I have no science to back this up, but I'm pretty sure most QA processes lead to a graph similar to the one at left.  Most of the bugs are ferreted out in the early phase of testing, but to getting rid of all high severity bugs becomes a decreasing return on investment.  So packaged software development managers must insist on a long test cycle to ensure high quality in their software.

Online services, however, have the beautiful property that you can update the software for all of your users by simply updating the server.  While that isn't always a trivial process, it is completely in the software development manager's control.  So now, instead of focusing on getting rid of every last high severity bug, we can focus on the other side of that equation; reducing the number of user sessions experiencing it.  That is a huge win because we can just do the first part of the QA curve where we find most of the bugs without paying the long price to get near 100% of the bugs out.  Further, there are lots of techniques for reducing the number of user sessions experiencing a bug, from rolling out the release slowly to a subset of servers, to doing daily bugfix releases to fix the bug you created yesterday.

But here's where most people screw it up.  People hear about agile and moving fast, and decide that they too must move fast to be competitive.  And they get excited about lowering their quality bar since it's easy to fix the bugs the next day.  And they hire only developers since all small companies should require developers to be testers (a subject of a future post).  And the developers rush out code they've barely tested at 7pm on Friday.  Then they go home, and are startled to get the SMS at 4am that their new code has broken.  So they resolve to fix it, and test it more carefully, and getting the fix and the test out takes until next Wednesday.
I see lots of places where they lower the quality bar, but don't take the next step of fixing problems quickly and generally release crap.  And thus was born the idea that agile development processes tend to be low quality.

But it doesn't have to be this way.  To release often and have a lower QA bar to release you must do several things:
  1. Have a simple and automated release process.  You plan to release often, but if it's cumbersome or manual, you will fail at this and spend too many resources doing releases.
  2. Have a simple and automated rollback process.  You're making a decision to roll out software that hasn't been rigorously tested.  You're definitely going to make mistakes and you'll need to fix them quickly.  Making sure that rollback works should be a part of your release to the staging environment before you release to production.
  3. Have a segment of your production site that you can release new software too without having to release to everyone.
  4. Have a source control environment that 100% replicates your production code, and is separate from your bleeding edge development environment.
  5. Plan to release regularly, and insist on fixing bugs the same day or next day they're discovered.
  6. Measure the number of days in production you've had for your various severities of bugs.  If you don't measure it, you will never fix it.
It's a culture and a mentality to do these things, but they're basics and really 101 level fundamentals on running a fast moving development organization.  But doing them, and focusing on the number of days in production for severe bugs, you can actually achieve higher quality systems than packaged software that has undergone a 6 month testing cycle.

Share this post!

Bookmark and Share


Post a Comment