Release dilemma and the cost of quality

Marwan Darwish
5 min readJan 21, 2021

One time back I was working in a very crucial application, with very fast pace business demand change, our application which was a very old and legacy app was stable enough, however, we have thousand of features many of them are surprisingly related, with very little documentation, and with hundreds of features marked on source control to be done by developers no one know them, it’s extremely difficult to define the impact of every change.

As a result, over a course of 15 years, day after day, release after release, with every failure, one more counteraction was added to the release processes, some of the counteractions was taken to mitigate extremely rare cases, some have very minor impact, but as a result, action following action, each which takes 10 to 15 minutes, extra item in the release checklist, years later , very minor change needs a work of two weeks, and internal customer has became extremely furious about this, sometime we were in a desperate need to deploy very urgent release, either to comply to some regulation, or fix crucial bugs, but we were unable to do so.

Our first step was to have a situational awareness, why our release cycle takes two weeks, which needed an assessment with the different teams starting from pushing the code to be ready for deployment till the happy moment of sending: “Release done”, when we have evaluated the situation, we found that we are doing more than 80 steps for deployment, and if we have excluded the developers, this involves: Operations, system admins, DBA, QC, BA and release manager to get all the work done.

Our very interesting finding that: Most of the time leakage was in two main items: Task handover from work center to another, the second point was the fact that we have a very long checklist, most of them take less than 15 minute, but most of them are mitigating either very rare cases, or non significant risks, a risks which can be mitigated in very little time or with low significance.

This drives us to the concept of cost of quality: 100% quality is an illusion, 100% quality means endless effort or cost, to dig inside this concept you need to take this example, a factory produces 100 pen per day, one pen costs 10 USD, and sold for 20 USDs, this means a total cost of 1000 USD and total revenue of 2000 USD, which leads to 1000 USD, for some quality reasons, 20% of the production line products are defected, which leads to only sellable 80 pen, each for 20 USD, with total of 1600 USD and net revenue of 600 USD, One solution was to buy a but plastic cut knife of 400 USD, in this case the quality will jump to 90% instead of the original 80%, which spares 200 USD daily, this means that the return on investment for the 400 dollars will be returned back in just two working days, which is an awesome investment.

Another purposed solution was to add one employee to done some issue fix in early stages, which can enhance the quality to 93% for only 50 USD per day, in this case we are sparing 3 more pen with overall of 30 USD revenue, does it make any sense to pay extra 50 USD to spare 30 USD ? even if this can spare up 50 USD, the net profit is big fat ZERO, the point is to consider the return on investment and the cost of the quality, sometimes for more extreme cases, you can buy a laser cut machine for 100K USD, to spare 400 USD per day.

This was one of the very big understandings from the devops and agile community to take calculated risks, it’s ok to have a planned or recoverable failure, if I have a specific issue with occurrence rate of 5% and can be recovered in 15 minutes, which means 15 minute once every 20 releases, while the mitigation actions takes 15 minute every release, so it’s more safe and cost efficient to just accept the risk and spare this 20*15 resource minutes.

Back to our story, does it really worth wasting over 50 hours or working every week to fix rare possible issue ?

The Project management institute (PMI) in it’s famous guide PMBOK (Project management body of knowledge) state that the way of evaluating risk = Risk possibility * Risk impact, So whenever you evaluate risk mitigation plan, you should match the above value to the cost, if the mitigation cost is higher that the above value you should probably look for a different plan, otherwise you should displace the risk (Like insurance plans ) or accept to let it go.

One example that if you are camping and you have some equipments that may be spoiled by rains, one mitigation plan is to buy a fabric tent. if the equipments price is 100 USD and the tent price is 100 USD, it doesn’t make since to buy the tent, but now if the equipment cost is 1000 USD and the tent cost is 100 USD, here comes the possibility of rain is 10% , in this case the total risk cost is 0.1 * 1000 USD which is 100 USD while the tent cost is 100 USD , which doesn’t worth the investment, while if it costs 20 USD it will definitely a good investment.

So back to the calculated risks and our exercise, our decision was to identify the real possible risks, either with high probability or high significance, all others will be accepted and counter actions if failed.

This decreased the list of needed actions from more than 80 to less than 30 action item.

the second big action was to decrease the handling time by assigning all tasks to expeditor to finish all tasks on the appropriate time.

Significantly this decreased the release cycle to less than 3 days only without any technical intervention by involving any release, build or test automation, and for more than four years since then, we hardly could remember any significant failure or outage due to this change.

The morale of the story: Some times all you need may be situational awareness and a little tweaking and you are just fine, before doing any major high impacting actions.

What we learnt from Agile community, is to take calculated risks

Every developer should

Originally published at http://demystifyprogramming.wordpress.com on January 21, 2021.

--

--