On 4 August 2017, we experienced a major service incident that lasted just over 2 hours affecting customers accessing VSTS accounts in either a US or Brazil datacenter (incident blog here). This was a repeat of an issue that impacted customers in Asia that occurred on 2 August 2017 (incident blog here) albeit with greater severity. We know how important VSTS is to our users and deeply regret the interruption in service. Over the last several days, we’ve worked to understand the incident’s full timeline, what caused it, how we responded to it and what improvements we need to make to avoid similar outages in the future.
Customer Impact:
We had two distinct impact periods which share a common root cause.
Starting at 7:39 UTC on 2 August 2017, some VSTS users in Asia experienced issues when trying to access their VSTS account. The issue was resolved for all users by 4:47 UTC on 3 August 2017. We are uncertain as to the number of users impacted as these errors occurred before any traffic hit VSTS. The users impacted received an error like the message below:
The second impact window started at 10:28 UTC on 4 August 2017. During this incident window, some VSTS users accessing a VSTS account in either a US or Brazil datacenter received server errors similar to the message below:
In addition, VSTS users in North or South America may have also experienced the same DNS issue as the first incident window (example above). This incident was resolved for all users by 12:49 UTC. The chart below shows the percentage of active VSTS users who were impacted by HTTP 500 errors during the second impact window:
What went wrong:
VSTS relies on a DNS service operated by Microsoft that is used for hosting Microsoft owned DNS zones for name resolution of both our public and internal endpoints. On the day of the incident, that DNS team was performing routine maintenance operations on its fleet of DNS servers. When a server is taken out of rotation for this purpose, it is expected that the records in the zones that were on that server are cleared out and a fresh copy with the latest information is loaded when the server comes back into the rotation. While taking one of the servers in the Asia region out of rotation, a race condition occurred between the process that clears the zone data and the process that normally updates zone data when the server is running. This resulted in this specific server retaining a partially complete copy of the visualstudio.com zone. At 7:39 UTC on 2 August 2017, this DNS server was put back into production and the process that updates it with the latest copy of the zones when it comes back to rotation failed to correctly update the visualstudio.com zone due to the presence of this partially complete zone data in that server. This resulted in that unhealthy DNS server not having DNS records for all the subdomains of visualstudio.com. As the Time to Live (TTL) expired for the DNS records for different accounts on VSTS user’s computers, some of the users were then requesting DNS resolution from the unhealthy server which was returning a NXDOMAIN response indicating that it did not have any knowledge of the domain. This response was then cached for 15 minutes preventing users from accessing their accounts during this time. The unhealthy DNS server was only one of multiple DNS servers in the region so not every VSTS user was impacted.
We received both internal and external customer escalations of the issue. However, by the time it had reached the correct team for troubleshooting, most users in the Asia region were done for the day. We escalated to the DNS team suspecting a DNS issue but without a live repro of the problem it we were unable to understand exactly what the issue was. As we had not received any reports of the issue during business hours in Redmond and had reports of the issue being mitigated in Asia, we decided to close the incident. What we did not realize at the time is that only one of many DNS servers were unhealthy and only users in Asia were susceptible to impact.
As VSTS users in Asia came back online the next day we began to get reports again of the issue. We immediately spun up a conference bridge and engaged the DNS team. This time we were able to work with internal users and our Customer Support organization to get a live repro of the issue. Diagnostics from an impacted user’s computer allowed us to quickly identify the unhealthy DNS server and the DNS team removed it from production at 4:32 UTC on 3 August 2017. All users were unblocked by 4:47 UTC as the NXDOMAIN response has a TTL of 15 min.
A reoccurrence of the issue happened on two North America DNS servers starting at 10:28 UTC on 4 August 2017. As before, the DNS service inadvertently introduced two unhealthy servers into production after performing patching. The impact was much greater during this window as our own services were using these unhealthy servers to do DNS resolution of one of our internal services, SPS.
The graph below shows the behavior of a single web server in one of our scale units. In this case, the DNS entry for SPS (which has a TTL of 5 min) was refreshed with a 15 min TTL NXDOMAIN response at 10:32 UTC from one of the unhealthy DNS servers causing it to return errors to the users for 15 minutes. When it refreshed at 10:47 UTC it again got a bad response from a DNS server causing another 15 minutes of errors. At ~11:03 the server received a good response and continued to receive one every 5 min until 11:23. This pattern continued until the incident was mitigated.
Both incident windows share a common root cause: a race condition that happens when DNS servers are taken out of rotation for patching and is triggered due to a certain data shape of our DNS data and the data inconsistency in the zone when the server is put back into rotation after being subject to this race condition.
During the first incident window, it was challenging trying to determine exact customer impact and the end-user experience as it was a client-side only error that never reached our telemetry. After consulting with the DNS team, we were ready to engage if the issue reappeared in Asia, which it did. We reengaged with the DNS team at 3:33 UTC on 3 Aug and we addressed the issue by 4:32 UTC. Most of the time between reengagement and mitigation was spent validating that all remaining servers were healthy, gaining access to production resources and removing the unhealthy server.
In the second incident window, we detected the incident at 10:41 UTC on 4 August however we escalated through a more generic support team and it took 1 hour and 27 minutes for us to get connected to the DNS team. The unhealthy servers were removed from production at 12:13 UTC, 18 minutes after the DNS team was engaged.
Next Steps:
In retrospect, there are several things we could have done to mitigate the incident faster. Below we list the key repair items that we’ll be implementing to improve our telemetry, alerting and overall live site process. Once these are in place, these will help us understand the scope and source of issues like this much faster while decreasing time to mitigate.
The matrix below categorizes the issues and gaps being addressed with details on the solutions we are committed to delivering:
Category | Issue being addressed | Improvement |
---|---|---|
Process | Some alerts are send to a Line 1 support team who then engage the SRE team. This process added 10 minutes to incident time. | Send known valid alerts directly to the SRE team. |
Process | Engaging the DNS team took over 40 minutes. | Streamline the process for engaging partner teams when the issue is well understood. |
Detection | It took us 13 minutes to detect that our services were unable to resolve the SPS endpoint | Create an alert which specifically looks for “The remote name could not be resolved” exception. This exception should never appear as part of normal service operations. |
Resiliency | TTL of negative DNS responses were cached for 15 minutes, lengthening impact | Adjust negative TTL limit from 15 to 1 minute |
In addition, the DNS team is working to prevent similar incidents in the future by executing the following repair items.
Category | Issue being addressed | Improvement |
---|---|---|
Software | Race condition resulting in partial zone data when servers are taken out of rotation | Software fix to ensure the zones are completely flushed out before de-provisioning is complete |
Detection | Not detecting corrupt zone data when a server is bought back to rotation | Software fix to ensure servers with corrupt zone data are not serving DNS requests. |
Detection | We were not aware of the corrupt zone data in our server | Implement automated data integrity checkers |
Detection | We were not able to locate and take out of production rotation the corrupt server fast enough | Enhance our current tooling and build new capabilities as needed to identify and take out of production rotation, with faster turnarounds, any server(s) with corrupt zone data |
Again, we want to offer our apologies for the impact this incident had on our users. We take the reliability and performance of our service very seriously. Please understand that we are fully committed to learning from this event and delivering the improvements needed to avoid future issues.
Sincerely,
Tom Moore, VSTS SRE Group Manager