Amazonクラウドを襲った嵐が、EC2、EBS、ELB、RDSの障害に発展した理由（後編）

2012年7月10日

米国で6月29日の夜に発生したAmazonクラウドのトラブルは、InstagramやFlipboard、Netflixなど有名なサービスにも影響を与えました。国内のサービスでもいくつか影響を受けたところがあったようです。

Amazonクラウドがこの障害について詳しく報告した「Summary of the AWS Service Event in the US East Region」のポイントを追っていきましょう。

（本記事は「Amazonクラウドを襲った嵐が、EC2、EBS、ELB、RDSの障害に発展した理由（前編）」の続きです）

Amazon ELBが受けた影響と対策

今回の障害でもっとも影響が大きかったのが、ロードバランシング機能を提供するELBです。報告書では、ELBの障害の影響は複数のアベイラビリティゾーンにわたり、その原因が未知のバグにあったことが記述されています。以下、報告書から。

ELBは複数のアベイラビリティゾーン（マルチアベイラビリティゾーン）へデプロイできる。この場合、1つのDNSに対して複数のIPアドレスを持ち、WebブラウザがDNSでドメインネームからIPアドレスを引くたびにランダムなIPアドレスが渡されるようになっている。

ELBs can also be deployed in multiple Availability Zones. In this configuration, each Availability Zone’s end-point will have a separate IP address. A single Domain Name will point to all of the end-points’ IP addresses.

マルチアベイラビリティゾーンのELBでは複数のアベイラビリティゾーンによる冗長化を行って、特定のアベイラビリティゾーンが落ちたとしても外部からは問題なく利用できるようにしている。

For multi-Availability Zone ELBs, the ELB service maintains ELBs redundantly in the Availability Zones a customer requests them to be in so that failure of a single machine or datacenter won’t take down the end-point.

ELBは障害を検知し、その影響を回避するためにDNSのIPアドレスから自動的に障害が発生したアベイラビリティゾーンを除外する。ELBコントロールプレーンはそうしたあらゆる処理、障害時のトラフィック移行、負荷が高まったときのスケール操作、EC2インスタンスの追加削除などを実行する。

The ELB service avoids impact (even for clients which can only process a single IP address) by detecting failure and eliminating the problematic ELB instance’s IP address from the list returned by DNS. The ELB control plane processes all management events for ELBs including traffic shifts due to failure, size scaling for ELB due to traffic growth, and addition and removal of EC2 instances from association with a given ELB.

電源が失われたとき、ELBのコントロールプレーンはトラフィックの移行を始めた。

そして電源が復帰したとき、多くのELBが未知のバグを引き起こす状態に入った。バグによってELBは、より大きなインスタンスへとスケールを操作しようとした。

As the power and systems returned, a large number of ELBs came up in a state which triggered a bug we hadn’t seen before. The bug caused the ELB control plane to attempt to scale these ELBs to larger ELB instance sizes.

その結果、コントロールプレーンのバックログが洪水のようにあふれてしまった。同時に利用者は、電源喪失の影響を受けたアベイラビリティゾーンにあったELBのインスタンスの代替を別のアベイラビリティゾーンで立ち上げようともしていた。

This resulted in a sudden flood of requests which began to backlog the control plane. At the same time, customers began launching new EC2 instances to replace capacity lost in the impacted Availability Zone, requesting the instances be added to existing load balancers in the other zones.

これがELBの大量のバックログとなった。ELBコントロールプレーンは米東リージョンの管理要求を共有キューで処理していたためだ。そして処理にどんどん時間がかかるようになっていった。

These requests further increased the ELB control plane backlog. Because the ELB control plane currently manages requests for the US East-1 Region through a shared queue, it fell increasingly behind in processing these requests; and pretty soon, these requests started taking a very long time to complete.

電源喪失によって直接影響を受けたELBのインスタンスは限定的だったにもかかわらず、ELBそのものは迅速に処理ができなくなってしまい、リカバリのためにほかのアベイラビリティゾーンで新しいインスタンスを起動しようとしていた多くの利用者に対しての処理が遅くなってしまった。

While direct impact was limited to those ELBs which had failed in the power-affected datacenter and hadn’t yet had their traffic shifted, the ELB service’s inability to quickly process new requests delayed recovery for many customers who were replacing lost EC2 capacity by launching new instances in other Availability Zones.

ここから学んだこととして、ELBの処理を複数のキューに分割し、全体のスループットを改善し、かつ､トラフィックの移行のような迅速さが重要な処理をすぐに行えるようにした。さらに、バックアップDNSの再重み付けにより、影響があったアベイラビリティゾーンへのトラフィックをコントロールプレーンに依存せず即座に移行できるよう開発を行っていく。

As a result of these impacts and our learning from them, we are breaking ELB processing into multiple queues to improve overall throughput and to allow more rapid processing of time-sensitive actions such as traffic shifts. We are also going to immediately develop a backup DNS re-weighting that can very quickly shift all ELB traffic away from an impacted Availability Zone without contacting the control plane.

Amazon RDSが受けた影響と対策

マルチアベイラビリティゾーンRDSでは、別々のアベイラビリティゾーンにあるインスタンスの1つがプライマリ、もう1つがスタンバイとなる。プライマリがデータベースの処理を行い、スタンバイにレプリケートされる。プライマリが落ちたときにはスタンバイがプライマリへ昇格する。

For Multi-AZ RDS, one of the two database instances is the “primary” and the other is a “standby.” The primary handles all database requests and replicates to the standby. In the case where a primary fails, the standby is promoted to be the new primary.

マルチアベイラビリティゾーンRDSのインスタンスは障害を検知すると、即座にアクションを始める。プライマリが落ちたとすれば、DNS CNAMEレコードの行き先がスタンバイへと書き換えられる。もしスタンバイが落ちたら新たなインスタンスが起動され、それが新スタンバイとなる。フェイルオーバーは1分以内に行われる。

Multi-AZ RDS Instances detect failure in the primary or standby and immediately take action. If the primary fails, the DNS CNAME record is updated to point to the standby. If the standby fails, a new instance is launched and instantiated from the primary as the new standby. Once failure is confirmed, failover can take place in less than a minute.

電源喪失が発生したデータセンターでは、多くのシングルアベイラビリティゾーンRDSのインスタンスが影響を受けた。しかしそれらは電源が復活すると再起動し、10時までに大半はオンラインへと復活した。

When servers lost power in the impacted datacenter, many Single-AZ RDS instances in that Availability Zone became unavailable. There was no way to recover these instances until servers were powered up, booted, and brought online. By 10pm PDT, a large number of the affected Single-AZ RDS instances had been brought online.

電源喪失が発生したとき、大半のマルチアベイラビリティゾーンRDSインスタンスは予定通りに即座に別のアベイラビリティゾーンのスタンバイをプライマリへと昇格させた。

At the point of power loss, most Multi-AZ instances almost instantly promoted their standby in a healthy AZ to “primary” as expected.

しかしながら、わずかなマルチAZ RDSインスタンスがフェイルオーバーに失敗した。原因はバグだ。このバグは4月にストレージ障害に対応するために行った変更に起因するものだった。

However, a small number of Multi-AZ RDS instances did not complete failover, due to a software bug. The bug was introduced in April when we made changes to the way we handle storage failure.

それは特定の通信障害のときに現れるものだったが、今回の件ではさまざまなシャットダウンシーケンスが発生し、これがフェイルセーフのトリガーとなって、フェイルオーバーを完了させるためにマニュアル操作が必要になった。

It is only manifested when a certain sequence of communication failure is experienced, situations we saw during this event as a variety of server shutdown sequences occurred. This triggered a failsafe which required manual intervention to complete the failover.

大半のケースで、マルチアベイラビリティゾーンRDSのフェイルオーバーは午後11時までに完了した。

The majority of remaining Multi-AZ failovers were completed by 11:00pm PDT.

この問題、マルチAZ RDSインスタンスのフェイルオーバーで発生した問題の解決のため、このバグの緩和策をテストし、数週間以内に展開していく予定だ。

To address the issues we had with some Multi-AZ RDS Instances failovers, we have a mitigation for the bug in test and will be rolling it out in production in the coming weeks.