Conclusion of Outage between 06:19 and 06:41 on 14/05/2020
Firstly, apologies for the outage this morning and I hope the short downtime did not affect you too much.
At 6:19 one of our physical ESX hosts failed which meant that all the virtual machines on that host were automatically restarted on another host. Our monitoring system alerted the on call engineer to the fault and an investigation was started. While we would normally expect a host failure to only cause a minute or two of downtime for the affected machines, in this particular case it lasted beyond that.
Several of the rebooted machines failed to reconnect to the network correctly and although successfully working at the OS level, they were not able to pass any network traffic. This fault was quickly identified and fixed with normal network traffic established to all client sites and services by 06:41.
The affected machines had multiple network cards which had been marked as detached during the host failure. The host failure itself was due to the VmWare ESX operating system crashing. Post crash investigation has revealed that it was due to a bug in part of the ESX software that we are not using but can apparently still cause this issue.
https://kb.vmware.com/s/article/71361Over the next week we will be applying emergency patches to our ESX hosts to patch this part of the software that, ironically, we don't actually use. No downtime will be required for customers in order to patch the hosts as virtual machines will be live migrated from one to another as happens several times a day due to load variations.
Apologies once again for the outage. We will now be marking this incident as resolved.
regards,
Paul Orrock
Technical Director
Digital Craftsmen Ltd.