Amazonクラウドを襲った嵐が、EC2、EBS、ELB、RDSの障害に発展した理由（前編）

2012年7月10日

米国で6月29日の夜に発生したAmazonクラウドのトラブルは、InstagramやFlipboard、Netflixなど有名なサービスにも影響を与えました。国内のサービスでもいくつか影響を受けたところがあったようです。

Summary of the AWS Service Event in the US East Region

今回のトラブルの発端は電源障害でした。嵐によって一時的に電源に障害が発生し、UPSに切り替わったもののUPSが電力を使い果たした結果、一部のデータセンターが稼働を停止しました。

この電源障害による直接の影響は全体の数％だったものの、これが引き金となって仮想マシンの「EC2」、ストレージの「EBS」、ロードバランシングの「ELB」、データベースサービスの「RDS」などにおいてソフトウェアや運用まわりでの問題が相次いで発生。一部で複数のアベイラビリティゾーンに影響する障害へと発展しました。

今回は複数のアベイラビリティゾーンに影響があり、その上6月中旬にも電源まわりのトラブルが発生したばかり。今回の障害は同社にとって痛恨の出来事といっていいでしょう。

一方で、「Summary of the AWS Service Event in the US East Region」として7月2日に公開された今回の報告は非常に詳細かつ長文で、障害の原因から経過と結果まであらゆる点を自分たちは徹底的に調査し把握し、対策していくのだ、という同社の意地のようなものを感じます。

一般にシステム構築案件などで大規模なトラブルが発生すると、その解析には数週間以上かかることも珍しくありません。しかも複数のベンダがからんだシステムであれば、あちこちからエンジニアが集まって問題を切り分ける難しさがあり、またベンダごとの責任の押し付け合いなどさまざまなことが起こります。

しかしAmazonクラウドは、データセンターからインフラ、ソフトウェアまでのすべてを自社で構築し運用しているからこそ、トラブル時にもすべてを把握し、即座に調査報告と対策が打てるのだという自負と底力を、Amazonクラウドの報告書から読み取ることができます。

その報告書、非常に膨大な内容の中から、ポイントになる点を追っていきましょう。

電源障害でUPSも使い果たし、10分の電源喪失

米東1リージョンは複数のアベイラビリティゾーンを構成する10以上のデータセンターからなる。それぞれのアベイラビリティゾーンは物理的にも技術的にも分離されたものだ。

Our US East-1 Region consists of more than 10 datacenters structured into multiple Availability Zones. These Availability Zones are in distinct physical locations and are engineered to isolate failure from each other.

午後7時24分（太平洋夏時間）。ある1つのアベイラビリティゾーンをサポートする2つのデータセンターの電源スイッチが大きなスパイクを受けた。（新野注：おそらく雷などの影響で送電が不安定になったものと思われる）

At 7:24pm PDT, a large voltage spike was experienced by the electrical switching equipment in two of the US East-1 datacenters supporting a single Availability Zone.

電源はいったん発電機に切り替えられたが、そのあとまもなく送電が戻ったため、担当者はデータセンターの電源を送電にもどす。しかしこの送電は7時57分に2度目の障害を起こしてしまう。

Shortly thereafter, utility power was restored and our datacenter personnel transferred the datacenter back to utility power. The utility power in the Region failed a second time at 7:57pm PDT.

このとき、この1つのデータセンターが発電機への切り替えに失敗する。しかしUPSにより問題なく動作を継続できた。

In the single datacenter that did not successfully transfer to the generator backup, all servers continued to operate normally on Uninterruptable Power Supply (“UPS”) power.

担当者がプライマリとバックアップの発電機の安定化作業をしていたが、UPSは8時4分に電力を使い果たしてしまう。その10分後に発電機が安定し、8時14分に電源復帰。8時24分にはデータセンター内すべての設備が復帰した。

As onsite personnel worked to stabilize the primary and backup power generators, the UPS systems were depleting and servers began losing power at 8:04pm PDT. Ten minutes later, the backup generator power was stabilized, the UPSs were restarted, and power started to be restored by 8:14pm PDT. At 8:24pm PDT, the full facility had power to all racks.

電源喪失が各サービスの障害へと広がっていく

ここまでの状況は、一部のデータセンターで10分間の電源喪失があった、という現象です。しかしこれが引き金になって影響が拡大していきます。報告書では障害全体を次のように説明しています。

問題が発生したデータセンターのEC2やEBS、RDS、ELBなどの割合は、米東1リージョン全体から見れば数パーセントでしかない。けれども、多くの利用者に重大な影響を及ぼした。

Though the resources in this datacenter, including Elastic Compute Cloud (EC2) instances, Elastic Block Store (EBS) storage volumes, Relational Database Service (RDS) instances, and Elastic Load Balancer (ELB) instances, represent a single-digit percentage of the total resources in the US East-1 Region, there was significant impact to many customers.

影響は2つに分類できる。1つ目は、電源障害が発生したデータセンターでのインスタンスやボリュームが利用できなかったことによる。この影響は特定のアベイラビリティゾーンに限定され、ほかのアベイラビリティゾーンは問題なく稼働していた。

The impact manifested in two forms. The first was the unavailability of instances and volumes running in the affected datacenter. This kind of impact was limited to the affected Availability Zone. Other Availability Zones in the US East-1 Region continued functioning normally.

2つ目の影響は、「コントロールプレーン」サービスの機能低下によるもの。コントロールプレーンはリージョン全体のリソースを利用者が操作し、作成、削除、更新などを行う。

The second form of impact was degradation of service “control planes” which allow customers to take action and create, remove, or change resources across the Region.

2つ目の影響として指摘されたこのコントロールプレーンの問題が、仮想マシンの「EC2」、ストレージの「EBS」、ロードバランシングの「ELB」、データベースサービスの「RDS」などの障害へと広がっていきます。

ここからは、このコントロールプレーンの問題に絡んで各サービスで何が起きたか、報告を見ていきましょう。

Amazon EC2とEBSが受けた影響と対策

米東リージョンの約7%にあたるEC2インスタンスが、電源が失われたことによって影響を受けた。

大半のインスタンスは午後11時15分から深夜までに復帰した。リカバリに時間がかかったのはサーバのブートプロセスにボトルネックがあったためで、このボトルネックを取り除く改善を行っていくつもりだ。

The vast majority of these instances came back online between 11:15pm PDT and just after midnight. Time for the completion of this recovery was extended by a bottleneck in our server booting process. Removing this bottleneck is one of the actions we’ll take to improve recovery times in the face of power failure.

EC2もほぼ同じ割合で影響を受けた。大半のEBSは夜12時15分までに復帰したが、電源喪失時に書き込みを行っていたボリュームについてはデータの整合性に問題がある可能性がある。

EBS had a comparable percentage (relative to EC2) of its volumes in the Region impacted by this event. The majority of EBS servers had been brought up by 12:25am PDT on Saturday. However, for EBS data volumes that had in-flight writes at the time of the power loss, those volumes had the potential to be in an inconsistent state.

EBSボリュームのリカバリ時間はこの6カ月でかなり短縮されたが、多くの要求が集まったためバックログ解消には数時間かかった。午前2時45分時点で90％まで解消した。今後はこの処理の最適化をさらに進め、時間短縮を実現していく。

Though the time to recover these EBS volumes has been reduced dramatically over the last 6 months, the number of volumes requiring processing was large enough that it still took several hours to complete the backlog. By 2:45am PDT, 90% of outstanding volumes had been turned over to customers. We have identified several areas in the recovery process that we will further optimize to improve the speed of processing recovered volumes.

EC2とEBSのコントロープレーンに大量のバックログ

EC2とEBSのコントロールプレーンは電源喪失によって重大な影響を受け、新しいリソースの作成や変更ができなくなった。

The control planes for EC2 and EBS were significantly impacted by the power failure, and calls to create new resources or change existing resources failed.

午後8時4分から9時10分のあいだ、あらゆるアベイラビリティゾーンで、EC2のインスタンスやEBSのボリュームの立ち上げができなくなった。

From 8:04pm PDT to 9:10pm PDT, customers were not able to launch new EC2 instances, create EBS volumes, or attach volumes in any Availability Zone in the US-East-1 Region.

コントロールプレーン復帰に時間がかかった理由は、新プライマリデータストアへの迅速なフェイルオーバーが適切に行われなかったためだ。

The duration of the recovery time for the EC2 and EBS control planes was the result of our inability to rapidly fail over to a new primary datastore.

EC2とEBS APIは複数のアベイラビリティゾーンにレプリケートされたデータストアの実装を持つ。データストアにはインスタンスやボリューム、スナップショットのメタデータが保存されている。

The EC2 and EBS APIs are implemented on multi-Availability Zone replicated datastores. These datastores are used to store metadata for resources such as instances, volumes, and snapshots.

データストアの破壊からデータを守るため、現在はプライマリが電源喪失などで失われたときに、システムが自動的にほかのアベイラビリティゾーンのデータストアをリードオンリーモードに切り替える。これはほかのアベイラビリティゾーンが新プライマリへ問題なく昇格できると判断するまで続けられる（新野注：ここに時間がかかったようだ）。

To protect against datastore corruption, currently when the primary copy loses power, the system automatically flips to a read-only mode in the other Availability Zones until power is restored to the affected Availability Zone or until we determine it is safe to promote another copy to primary.

これはマニュアル操作で行われていたが、すでに自動的に判断して切り替えるように改善した。

We are addressing the sources of blockage which forced manual assessment and required hand-managed failover for the control plane, and have work already underway to have this flip happen automatically.

長くなったので後編に続きます。後編では、最も大きな障害となったAmazon ELBが受けた影響などについての説明が行われます。

≫後編へ続く