Host failure in Enfield datacentre

Incident Report for Digital Craftsmen

Resolved

Conclusion of Outage between 06:19 and 06:41 on 14/05/2020

Firstly, apologies for the outage this morning and I hope the short downtime did not affect you too much.

At 6:19 one of our physical ESX hosts failed which meant that all the virtual machines on that host were automatically restarted on another host. Our monitoring system alerted the on call engineer to the fault and an investigation was started. While we would normally expect a host failure to only cause a minute or two of downtime for the affected machines, in this particular case it lasted beyond that.

Several of the rebooted machines failed to reconnect to the network correctly and although successfully working at the OS level, they were not able to pass any network traffic. This fault was quickly identified and fixed with normal network traffic established to all client sites and services by 06:41.

The affected machines had multiple network cards which had been marked as detached during the host failure. The host failure itself was due to the VmWare ESX operating system crashing. Post crash investigation has revealed that it was due to a bug in part of the ESX software that we are not using but can apparently still cause this issue. https://kb.vmware.com/s/article/71361

Over the next week we will be applying emergency patches to our ESX hosts to patch this part of the software that, ironically, we don't actually use. No downtime will be required for customers in order to patch the hosts as virtual machines will be live migrated from one to another as happens several times a day due to load variations.

Apologies once again for the outage. We will now be marking this incident as resolved.

regards,

Paul Orrock
Technical Director
Digital Craftsmen Ltd.
Posted May 14, 2020 - 18:18 BST

Monitoring

All services have now been fully restored.

We continue to monitor the situation whilst we investigate the root cause of the failure.

Simon Wilcox.
Posted May 14, 2020 - 07:16 BST

Identified

We have experienced a hardware failure in our Enfield datacentre.

Services are currently automatically restarting on alternative hosts.

Some client sites are affected and may be experiencing downtime whilst this happens.

We apologise for the inconvenience.

Simon Wilcox
Posted May 14, 2020 - 06:49 BST
This incident affected: Locations (Enfield Datacentre) and Infrastructure (Compute Cloud).