Post Mortem: Networking failures for Visual Studio Team Services on 12 July 2017

On 12 July 2017, we experienced a networking incident that lasted 19 minutes that affected VSTS customers using the service (You can find the incident blog details here). We sincerely regret the inconvenience that this has caused our users. We have conducted an internal post mortem along with our partners in Azure networking to look into the incident details and have identified improvements we need to make to avoid similar outages in the future.

Customer Impact:

Starting at 06:34 UTC on 12 July 2017, for a period of 19 minutes, VSTS users experienced performance issues and internal server errors while trying to sign in and interact with the service.

The chart below shows the percentage of active VSTS users who experienced errors while trying to sign in or access the service during the incident window.

What went wrong:

A firmware update was being rolled out to route reflectors across the WAN fleet as part of ongoing upgrade and maintenance operations by the Azure networking team. During this upgrade, there was a human error that resulted in a set of upgrade operations being performed across multiple route reflectors simultaneously resulting in the loss/removal of critical redundancies. The route reflectors recovered as soon as the upgrade completed after which networking services and access to VSTS was completely restored. More details about the Azure Network incident can be found here

The incident was detected through our outside in monitoring tests as shown in the chart below and also through a series of Circuit Breaker Exceptions that are fired when a key dependency of VSTS fails (in this case it was networking).

Next Steps:

As part of our continuous efforts to improve the overall service, in partnership with the networking team we have identified the following opportunities as areas of improvement:

Introduce additional software checks to prevent multiple router updates from occurring simultaneously
Review the route maintenance sequence to ensure that critical router dependencies in a region are not updated during the same day
While the incident was detected through our automated outside in alerts, we have identified additional alerting that can be introduced in VSTS to be able to narrow down networking specific alerts for some of our core services.

We extend our apologies for the impact that this has caused our users. We strive to learn from every incident that our customers experience and will ensure that the repair items and improvement opportunities that were identified are executed upon in a timely manner.

Sincerely,
Harish Thekethil, VSTS SRE Manager

Post Mortem: Networking failures for Visual Studio Team Services on 12 July 2017

Trending Articles

Moondru Mudichu 20-07-2016 – Polimer tv Serial

Driver sought by police following a crash in Camborne

Reading test, level 2

Blackstone — Befi Mano (Throw Back Thursday)

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

Black Angus Grilled Artichokes

30-03-2016 – Ponnoonjal

Former bodybuilder found guilty of attacking ex-partner

Created Release: VG-Ripper 2.9.64 (Sep 13, 2014)

that noob abjure/glariel nv post any thread tonight?

Practice Sheet of Right form of verbs for HSC Students

TWRP on S20+ SM - G985F (Exynos) ? One ui 5.1 / android 13

Skeng & Tommy Lee Sparta – Disappear Season (feat. Nicki Minaj) – Single...

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

Ricky Martin – Vente Pa’ Ca (feat. A-Lin) – iTunes Plus AAC M4A – Single

A/L Technology Stream – Subject combinations, Syllabuses and Teacher guides

SOFT COPY ZA NGAIZA CHEMISTRY

Bureau of Internal Revenue: Regional Offices (Directory)

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Improve virtio-blk device performance using iothread-vq-mapping