We all know that monitoring and alerting are important, right? But just how important do we really think they are? How much do we show that in what we do?
This is a story about an experience I had at a previous company, where I had the pleasure of being at least partially responsible for the system going down, and what we learned.
As some scene-setting - we had to implement a service that would be accessed anytime a user visited the site. I did the work, got it reviewed, tested, and hey-presto, got it live. More testing, ensure nothing went bang, and went home.
The next morning, I return to work, and all is well. About 3 hours in the site goes down. Panic! Time to follow the steps...
For the purpose of this story, this is actually not important (although obviously the initial fix was we just rolled back to pre-my-change). What is important is how did we get here?
First thing's first - let's have a look at the monitoring that was in place on the service I was calling:
Everything looks fine! Let's check the whole view:
Oh dear. So, the service was getting warmer and warmer, until it reached 100% usage and any subsequent requests stopped being served, failed, and the site calling it fell over.
Ok, so the service got overloaded. Let's have a look at the number of calls we were making... oh. There is no monitoring on the number of calls we were making.
Ok, let's assume we did have that monitoring in place - what is the expected number of calls?
Great - now let's have a look at the actual:
I think that would do it, yes. Overloading the service with around 50x the expected calls (that's already taking into account the headroom planned).
There's a common misunderstanding with coding, in that the coder themselves is to blame when things go wrong. That's not incorrect, but it's certainly not the whole story. Let's have a look at who else was involved in that deployment:
So it's not so simple. Software development is a team effort, so there's never a single person to blame.
To quote the product owner:
Why the hell didn't we know about this earlier, before it became an issue?
It's the only question that really matters in this instance. We had monitoring for the service, but no alerting set up for when it was getting too warm. We had no monitoring in place at all for the number of calls being made, and therefore no alerting.
So, some takeaways from my (traumatic) trip down memory lane about how I took down a site: