Rackspace has sent us the full incident reports from their June 12 and June 20 downtimes. We have reposted them in entirety below.
June 12th ORD Cloud Server Instability
At approximately 10:30 a.m. CDT, our cloud engineers were alerted to an issue impacting services for several thousand customers within our ORD1 data center.
This issue was caused when our Software Defined Network (SDN) cluster suffered cascading node failures, causing some customers to experience intermittent network connectivity, and in some cases extended service interruption, until approximately 4:30 pm CDT.
The controller node failures were caused by corrupted port data from Open vSwitch. The corrupted port data triggered a previously unidentified bug that caused nodes within the control cluster to crash repeatedly until the corrupted port data was identified and fixed. The cluster was repaired and customers began to come back online, with all residual effects eliminated by 4:30 p.m. CDT. The system is now stable, and we are working with our SDN vendor on a permanent fix.
Why did we experience issues within the APIs for both DFW and ORD?
While we were experiencing service degradation in the ORD region for Next Gen Cloud Servers; Rackspace also saw availability dips in both our ORD and DFW Next Gen APIs.
During this time, we experienced increased traffic in our Control Panel as customers began logging in to check their instances in ORD after the network degradation began. This caused additional load on the systems responsible for image management in both regions. Under the conditions of increased traffic, these particular databases became overloaded which translated to dips in API availability.
Recent performance monitoring for those systems identified queries that could be optimized and were already scheduled for an upcoming code release. In order to fully resolve the issues in both regions, the query portions of the scheduled code release were hot patched into the environments, which restored API stability for both regions.
We apologize for any inconvenience this may have caused you or your customers. If you have any further questions please feel free to contact a member of your support team.
June 20th ORD Service Interruption During Scheduled Maintenance
While performing a scheduled upgrade on the Software Defined Networking (SDN) control cluster for Next Gen Cloud Servers in our ORD datacenter, we experienced two issues that created downtime for our customers and forced us to unexpectedly extend the maintenance window.
The first issue occurred when a configuration sync flag did not fully apply to all hypervisors via the upgrade manager software deploying the cluster updates. This caused issues for customers ranging from intermittent packet loss to a few minutes of network disruption. The root cause of this problem was in the manual configuration of the automated deployment tool, not the underlying cloud network. Rackspace and vendor engineers immediately identified and fixed the issue by 3:45 AM CDT, within the original maintenance window.
During the maintenance wrap-up process, Rackspace engineers discovered a component of the network configuration that was inadvertently overwritten by the upgrade. That component of the network configuration was deployed fairly recently, on May 24th, 2013, and was necessary to ensure that customer server connectivity was maintained and new server provisioning succeeded. Rackspace made the choice to extend the maintenance window by one hour, fix the configuration and reboot the clusters. The clusters finished syncing by 5:30 AM and then the hypervisors were able to check back in for updated flows. Any residual customer impact was confirmed complete between 5:45 AM and 6:00AM. Had Rackspace closed the maintenance window, our customers would have been exposed to potential intermittent network instability and provisioning errors until the next maintenance window was scheduled.
Rackspace prides itself on the transparency of our communications. In this event, we did not live up to our standards. We believe the decision to extend the window was the right decision for our customers, but we did not clearly communicate the rationale for the decision in the manner our customers expect.
Stability and uptime are paramount to our customers and to Rackspace. We apologize for the issues and the manner in which communications were handled. We are reviewing all elements of our maintenance and incident management processes to ensure that these issues do not occur again. If you have any questions please contact a member of your support team.