A couple of weeks ago we had a short service outing for PractiTest.
The service was down for about 22 minutes. This was the first time in over 18 months that our service was unavailable for more than a couple of minutes (and even this happened only twice) or as part of a scheduled maintenance.
Even though short outings like this one are common in our Industry (after all there is no system, not even Gmail, that doesn’t have glitches once in a while) we have gone through a serious retrospective analysis of what happened in order to avoid similar issues in the future, and maybe more importantly to respond even faster in the event something like this happens once again.
What went right
Part of our analysis showed that there were many things that worked correctly.
- We got both SMS messages as well as notification phone calls from our automatic monitoring systems telling us something was wrong with our servers.
- All back-up systems were working correctly (even though we did not really need them because no data was corrupted at any time).
- Our team was aware of the issue even before the first of our users contacted us.
Things to improve
We also detected a couple of things that need to be improved:
1. Because of system security procedures there were only 2 PractiTest employees who could respond and act when issues like this happened. Unfortunately this number seems to be not enough because at the exact time the issue happened one of them was commuting and the other one was also out of the office with a dead smart-phone battery.
To avoid issues like this we provided another employee with access to these servers. We are also creating an internal notification process to make sure that at least one of them is available 24/7 – with more than one way to communicate
2. Up to last week our internal monitoring system didn’t cover secondary services and one of such services turned to be the culprit. Now we’ll monitor all services, primary and secondary. This means that the monitoring process will give a head’s up before the services reach a dangerous level, so that we have more time to act.
3. One of the things we had already planned to do but may take a couple of sprints to have in place, is the ability to use the Autoscale system provided by Amazon. This will allow our system to automatically scale up in cases when the CPU of any of our services goes over a set threshold.
We already started working on this, but now we increased the development’s priority of it.
4. Last but not least, we want to provide even better visibility and transparency to what is happening in our service. We know that we have the best support and provide almost immediate answers via any of our current communication channels (e.g. Support Site, Skype, email, etc)
But we need to improve the way we broadcast information by being quicker with our twitter updates or by publishing blog posts such as this one faster and closer to the date of the incident.
(*) Just a short note to say that some of our technological plans may change as we have been accepted to be part of the Redhat Innovate program.
As always, we are here to answer any questions you may have about this or any other aspect of our service.
Please don’t hesitate to contact us via firstname.lastname@example.org.
The PractiTest team.