Amazonクラウドがこの障害について詳しく報告した「Summary of the AWS Service Event in the US East Region」のポイントを追っていきましょう。


Amazon ELBが受けた影響と対策



ELBs can also be deployed in multiple Availability Zones. In this configuration, each Availability Zone’s end-point will have a separate IP address. A single Domain Name will point to all of the end-points’ IP addresses.


For multi-Availability Zone ELBs, the ELB service maintains ELBs redundantly in the Availability Zones a customer requests them to be in so that failure of a single machine or datacenter won’t take down the end-point.


The ELB service avoids impact (even for clients which can only process a single IP address) by detecting failure and eliminating the problematic ELB instance’s IP address from the list returned by DNS. The ELB control plane processes all management events for ELBs including traffic shifts due to failure, size scaling for ELB due to traffic growth, and addition and removal of EC2 instances from association with a given ELB.



As the power and systems returned, a large number of ELBs came up in a state which triggered a bug we hadn’t seen before. The bug caused the ELB control plane to attempt to scale these ELBs to larger ELB instance sizes.


This resulted in a sudden flood of requests which began to backlog the control plane. At the same time, customers began launching new EC2 instances to replace capacity lost in the impacted Availability Zone, requesting the instances be added to existing load balancers in the other zones.


These requests further increased the ELB control plane backlog. Because the ELB control plane currently manages requests for the US East-1 Region through a shared queue, it fell increasingly behind in processing these requests; and pretty soon, these requests started taking a very long time to complete.


While direct impact was limited to those ELBs which had failed in the power-affected datacenter and hadn’t yet had their traffic shifted, the ELB service’s inability to quickly process new requests delayed recovery for many customers who were replacing lost EC2 capacity by launching new instances in other Availability Zones.


As a result of these impacts and our learning from them, we are breaking ELB processing into multiple queues to improve overall throughput and to allow more rapid processing of time-sensitive actions such as traffic shifts. We are also going to immediately develop a backup DNS re-weighting that can very quickly shift all ELB traffic away from an impacted Availability Zone without contacting the control plane.

Amazon RDSが受けた影響と対策


For Multi-AZ RDS, one of the two database instances is the “primary” and the other is a “standby.” The primary handles all database requests and replicates to the standby. In the case where a primary fails, the standby is promoted to be the new primary.

マルチアベイラビリティゾーンRDSのインスタンスは障害を検知すると、即座にアクションを始める。プライマリが落ちたとすれば、DNS CNAMEレコードの行き先がスタンバイへと書き換えられる。もしスタンバイが落ちたら新たなインスタンスが起動され、それが新スタンバイとなる。フェイルオーバーは1分以内に行われる。

Multi-AZ RDS Instances detect failure in the primary or standby and immediately take action. If the primary fails, the DNS CNAME record is updated to point to the standby. If the standby fails, a new instance is launched and instantiated from the primary as the new standby. Once failure is confirmed, failover can take place in less than a minute.


When servers lost power in the impacted datacenter, many Single-AZ RDS instances in that Availability Zone became unavailable. There was no way to recover these instances until servers were powered up, booted, and brought online. By 10pm PDT, a large number of the affected Single-AZ RDS instances had been brought online.


At the point of power loss, most Multi-AZ instances almost instantly promoted their standby in a healthy AZ to “primary” as expected.

しかしながら、わずかなマルチAZ RDSインスタンスがフェイルオーバーに失敗した。原因はバグだ。このバグは4月にストレージ障害に対応するために行った変更に起因するものだった。

However, a small number of Multi-AZ RDS instances did not complete failover, due to a software bug. The bug was introduced in April when we made changes to the way we handle storage failure.


It is only manifested when a certain sequence of communication failure is experienced, situations we saw during this event as a variety of server shutdown sequences occurred. This triggered a failsafe which required manual intervention to complete the failover.


The majority of remaining Multi-AZ failovers were completed by 11:00pm PDT.

この問題、マルチAZ RDSインスタンスのフェイルオーバーで発生した問題の解決のため、このバグの緩和策をテストし、数週間以内に展開していく予定だ。

To address the issues we had with some Multi-AZ RDS Instances failovers, we have a mitigation for the bug in test and will be rolling it out in production in the coming weeks.


AWS クラウド


AWS / Azure / Google Cloud
クラウドネイティブ / サーバレス
クラウドのシェア / クラウドの障害


JavaScript / Java / .NET
WebAssembly / Web標準
開発ツール / テスト・品質

アジャイル開発 / スクラム / DevOps

データベース / 機械学習・AI

ネットワーク / セキュリティ

OS / Windows / Linux / 仮想化
サーバ / ストレージ / ハードウェア

ITエンジニアの給与・年収 / 働き方

殿堂入り / おもしろ / 編集後記


Blogger in Chief

photo of jniino

Junichi Niino(jniino)

Twitterで : @Publickey
Facebookで : Publickeyのページ
RSSリーダーで : Feed