It is good that cloud providers test for disaster, but it’s bad when it causes downtime.
RCA – Storage Related Incident – North Europe
Summary of impact: Between 13:27 and 20:15 UTC on 29 Sep 2017, a subset of customers in North Europe may have experienced difficulties connecting to or managing resources hosted in this region due to availability loss of a storage scale unit. Services that depend on the impacted storage resources in this region that may have seen impact are Virtual Machines, Cloud Services, Azure Backup, App Services\Web Apps, Azure Cache, Azure Monitor, Azure Functions, Time Series Insights, Stream Analytics, HDInsight, Data Factory and Azure Scheduler, Azure Site Recovery.
Customer impact: A portion of storage resources were unavailable resulting in dependent Virtual Machines shutting down to ensure data durability. Some Azure Backup vaults were not available for the duration resulting in backup and restore operation failures. Azure Site Recovery may not be able to failover to latest recovery points or replicate VMs. HDInsight, Azure Scheduler and Functions may have experienced service management and job failure where resources were dependent on the impacted storage scale unit. Azure Monitor and Data Factory may have seen latency and errors in pipelines that have dependencies in this scale unit. Azure Stream Analytics jobs stopped processing input and/or producing output for several minutes. Azure Media Services saw failures & latency for streaming requests, uploads, and encoding.
Workaround: Implementation of Virtual Machines in Availability Sets with Managed Disks would have provided resiliency against significant service impact for VM based workloads.
Root cause and mitigation: During a routine periodic fire suppression system maintenance, an unexpected release of inert fire suppression agent occurred. When suppression was triggered, it initiated the automatic shutdown of Air Handler Units (AHU) as designed for containment and safety. While conditions in the data center were being reaffirmed and AHUs were being restarted, the ambient temperature in isolated areas of the impacted suppression zone rose above normal operational parameters. Some systems in the impacted zone performed auto shutdowns or reboots triggered by internal thermal health monitoring to prevent overheating of those systems. The triggering of inert fire suppression was immediately known, and in the following 35 minutes, all AHUs were recovered and ambient temperatures had returned to normal operational levels. Facility power was not impacted during the event. All systems have been restored to full operational conditions and further system maintenance has been suspended pending investigation of the unexpected agent release. Due to the nature of the above event and variance in thermal conditions in isolated areas of the impacted suppression zone, some servers and storage resources did not shutdown in a controlled manner. As a result, additional time was required to troubleshoot and recover the impacted resources. Once the scale unit reached the required number of operational nodes, customers would have seen gradual, but consistent improvement until fully mitigated at 20:15 UTC when storage and dependent services were able to fully recover.
Next steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to): Suppression system maintenance analysis continues with facility engineers to identify the cause of the unexpected agent release, and to mitigate risk of recurrence. Engineering continues to investigate the failure conditions and recovery time improvements for storage resources in this scenario. As important investigation and analysis are ongoing, an additional update to this RCA will be provided before Friday, 10/13.