Amazonクラウド、ネットワーク機器の障害で一時動作不良。Heroku、Parse、Kinveyなどがダウン

2013年8月28日

8月25日13時頃（米国太平洋時間。日本時間の26日午前5時頃）、Amazonクラウドの米国東部（バージニア北部）リージョンで提供されているストレージサービスのAmazon EBSが、特定のアベイラビリティゾーンで動作不良を発生。パフォーマンスが低下し、その影響でHerokuやParse、Kinveyなどのサービスが一時的にダウンしたことが報告されています（Herokuのインシデントレポート）。

原因はネットワーク機器の不具合によるパケットロスにあったようで、現在は正常動作に復帰しています。しかしAmazonクラウドのネットワークが冗長構成になっていないはずがなく、単純な機器の故障で障害を引き起こすとは思えないので、おそらく原因はここで報告されている以上に複雑なものなのではないかと想像されます。

Amazonクラウドのステータスレポートから、状況を振り返ってみます。

米国東部リージョンのアベイラビリティゾーンで障害発生

8月25日13時22分（米国太平洋時間、以下同じ）、米国東部リージョンの特定のアベイラビリティゾーンでストレージボリュームの性能低下が発覚します。

We are investigating degraded performance for some volumes in a single AZ in the US-EAST-1 Region

その7分後、Amazon EBS関連のAPIやインスタンスの起動エラーへと状況が悪化します。

We are investigating degraded performance for some EBS volumes and elevated EBS-related API and EBS-backed instance launch errors in a single AZ in the US-EAST-1 Region.

14時21分、障害の原因を特定、インスタンスの起動は正常に行われるようになり、影響を受けたボリュームも正常動作に復帰。

We have identified and fixed the root cause of the performance issue. EBS backed instance launches are now operating normally. Most previously impacted volumes are now operating normally and we will continue to work on instances and volumes that are still experiencing degraded performance

その1時間後には、今回の障害について手短にまとめた報告が行われました。

From approximately 12:51 PM PDT to 1:42 PM PDT network packet loss caused elevated EBS-related API error rates in a single AZ, a small number of EBS volumes in that AZ to experience degraded performance, and a small number of EC2 instances to become unreachable due to packet loss in a single AZ in the US-EAST-1 Region. The root cause was a "grey" partial failure with a networking device that caused a portion of the AZ to experience packet loss. The network issue was resolved and most volumes, instances, and API calls returned to normal. The networking device was removed from service and we are performing a forensic investigation to understand how it failed. We are continuing to work on a small number of instances and volumes that require additional maintenance before they return to normal performance.

12時51分頃から1時42分頃まで、ネットワークパケットのロスが原因で、特定のアベイラビリティゾーンでEBS関連APIのエラー率が上昇し、そのアベイラビリティゾーン内の少数のEBSボリュームに性能低下が発生した。また、米国東部リージョンにある少数のEC2インスタンスに対してもパケットロスにより接続不能（unreachable）となった。

原因は“グレー”だが、ネットワークデバイスの部分的な障害によるパケットロスだ。ネットワークの問題は解決し、ほとんどのボリューム、インスタンス、APIは正常に復帰している。

原因となったネットワークデバイスは取り除かれ、なぜ故障したのか調査を始めている。まだ少数のインスタンスとボリュームについては正常復帰する前のメンテナンス作業が必要となっている。

18時前に全体の正常復帰が報告されました。