Windows Azureの欧州リージョンがダウンした原因は、ネットワークの設定ミスとバグだったとマイクロソフトが報告

2012年8月7日

7月26日、Windows Azureの西欧州サブリージョンが約2時間半にわたってシステムダウン。先週の8月2日木曜日、マイクロソフトはこの障害の原因について最終的な報告をWindows Azureのブログに「Root Cause Analysis for recent Windows Azure Service Interruption in Western Europe」（先日の西欧州におけるWindows Azureの障害における根本原因の分析）という記事で行いました。

まず7月26日の障害の状況を確認しましょう。

最初の障害報告は、Windows Azureのダッシュボードで、7月26日11時54分に行われています。

Jul 26 2012 11:09AM We are experiencing an availability issue in the West Europe sub-region, which impacts access to hosted services in this region. （以下略）

7月26日午前11時9分、西欧州サブリージョンで障害発生。このリージョンのホステッドサービスへの接続に影響が出ている。

14時5分に行われた以下の報告で、1時33分に解決されたと報告されました。

Jul 26 2012 1:33PM The issue has been addressed. Full service functionality has been restored in the region. （以下略）

7月26日午後1時33分。問題は解決された。このリージョンのすべてのサービス機能はリストアされた。

データが失われたりはしていない模様ですが、当然ながらこのリージョンで稼働していたアプリケーションはダウンするなどの影響を受けていました。

この障害の原因は何だったのか。マイクロソフトの報告を見てみます。

Root Cause Analysis for recent Windows Azure Service Interruption in Western Europe - Windows Azure - Site Home - MSDN Blogs

制限値を設定ミス。そしてバグが発生

報告を行っているのは、Windows Azureのゼネラルマネージャ、Mike Neil氏。その内容を順に見ていきましょう。

まず、Windows Azureのデータセンター内で使われているネットワーク機器には、ネットワークの連鎖的な障害発生を未然に防ぐために、コネクションの範囲を制限する安全弁的なメカニズムが備わっているとのこと。

Windows Azure’s network infrastructure uses a safety valve mechanism to protect against potential cascading networking failures by limiting the scope of connections that can be accepted by our datacenter network hardware devices.

今回の障害発生の前に、西欧州サブリージョンには容量増加の作業が行われたけれども、この容量追加に見合う制限値が設定されていなかったと。

Prior to this incident, we added new capacity to the West Europe sub-region in response to increased demand. However, the limit in corresponding devices was not adjusted during the validation process to match this new capacity.

そしてこのクラスタに対する利用率が急上昇した結果、閾値を超え、それが相当量のネットワーク管理メッセージを発生させた。

Because of a rapid increase in usage in this cluster, the threshold was exceeded, resulting in a sizeable amount of network management messages.

増加した管理メッセージが引き金となり、クラスタ内のハードウェアでバグが発生。CPU利用率が100％となり、データトラフィックに影響を与えたと。

The increased management traffic in turn, triggered bugs in some of the cluster’s hardware devices, causing these to reach 100% CPU utilization impacting data traffic.

ここまでが障害がなぜ起きたかを解説した部分。たしかに起きた事象を説明してはいますが、表層をなぞった程度の説明に終わっています。