Amazonクラウド、ストレージ障害は潜在バグからメモリリーク発生が原因。きっかけはDNSの変更

2012年10月29日

先週の10月22日月曜日、Amazonクラウドの米国東リージョンでストレージ障害が発生しました。その原因は、Amazon EBSストレージサーバのバグがメモリリークを引き起こしことだと、Amazonクラウドが報告書「Summary of the October 22, 2012 AWS Service Event in the US-East Region」で明らかにしました。

Summary of the October 22, 2012 AWS Service Event in the US-East Region

Amazon EBSストレージサーバのメモリリークを引き起こしたのは、内部DNSの設定ミスによって発生したサーバ間の接続エラーでした。このエラーが潜在的なバグを呼び起こし、知らぬ間に多くのサーバが影響を受け、それがある時点で一斉にストレージ障害という現象を引き起こしたのです。

ここではAmazonクラウドのレポートから、障害発生の経緯について概要を紹介します。

データ収集エージェントの潜在バグが目をさます

障害の始まりは、アベイラビリティゾーン内の少数のEBSボリュームにおける性能低下でした。

At 10:00AM PDT Monday, a small number of Amazon Elastic Block Store (EBS) volumes in one of our five Availability Zones in the US-East Region began seeing degraded performance, and in some cases, became “stuck” (i.e. unable to process further I/O requests).

10月22日月曜日午前10時（太平洋時間）、米東リージョンにある5つのアベイラビリティゾーンのうちの1つで、少数のAmazon EBSボリュームが性能低下を示し始め、いくつかはスタック（それ以上I/Oリクエストを処理できない状態）するようになった。

各EBSストレージサーバは情報収集エージェントを実行しており、データ収集サーバに対してネットワーク経由で接続、報告をしています。障害の原因は、この情報収集エージェントのバグでした。そして、そのバグを引き起こしたのはDNSの設定ミスでした。

Last week, one of the data collection servers in the affected Availability Zone had a hardware failure and was replaced. As part of replacing that server, a DNS record was updated to remove the failed server and add the replacement server.

先週、今回障害が発生したアベイラビリティゾーンのデータ収集サーバが1台故障したため、新しいサーバへのリプレースを行った。DNSレコードには故障したサーバを削除し、新しいサーバを追加する更新が行われた。

しかしこのDNSの更新はなぜかうまくいかず、少数のEBSストレージサーバでDNS情報が更新されないまま、という状況が発生。古いDNS情報が更新されないままのEBSストレージサーバは、故障によって取り払われたサーバへの接続失敗を繰り返していました。そしてこれがバグの引き金に。

Because of the design of the data collection service (which is tolerant to missing data), this did not cause any immediate issues or set off any alarms. However, this inability to contact a data collection server triggered a latent memory leak bug in the reporting agent on the storage servers.

（データのロスは許容するという）データ収集サーバの設計上、これはすぐに問題につながらず、また警告も発せられなかった。けれど、このデータ収集サーバへの接続失敗は、EBSストレージサーバ上のエージェントの遅延メモリリークバグの引き金となった。

バグによって発生したメモリリークによるメモリ領域の圧迫も、検出が難しかったと。

EBS Servers generally make very dynamic use of all of their available memory for managing customer data, making it difficult to set accurate alarms on memory usage and free memory.

一般にEBSサーバはお客さまのデータを扱うためにメモリを非常にダイナミックに使うようになっているため、メモリの残り容量を正確に見極めて警告を発するのはきわめてむずかしい。

結果としてメモリリークを見逃し、EBSストレージサーバが少しずつスタックへ近づいていくことを見逃してしまい、月曜日の障害発生につながります。

障害発生。フェイルオーバー先が足りなくなる

Amazonクラウドは当然ながら、障害が発生したサーバを切り離し、正常なサーバへフェイルオーバーすることで障害に対応しようとします。

The memory pressure on many of the EBS servers had reached a point where EBS servers began losing the ability to process customer requests and the number of stuck volumes increased quickly. This caused the system to begin to failover from the degraded servers to healthy servers.

多くのEBSサーバにおいてメモリのプレッシャがEBSサーバの処理能力を失わせる閾値に達し、その数は急速に増えていった。これによって問題が生じたサーバから健全なサーバへとフェイルオーバーが発生していった。

ところが多くのサーバがスタックしていき、フェイルオーバー先が足りなくなっていきます。

However, because many of the servers became memory-exhausted at the same time, the system was unable to find enough healthy servers to failover to, and more volumes became stuck. By approximately 11:00AM PDT, a large number of volumes in this Availability Zone were stuck.

けれど多くのサーバが同時にメモリ問題にみまわれ、システム全体でフェイルオーバー先となるヘルシーなサーバが足りなくなっていった。さらにスタックするサーバが増えていき、11時前後にはアベイラビリティゾーン内のかなりの数のボリュームがスタックした。

担当チームはフェイルオーバー率を調整するなどで障害に対応。システム全体は少しずつ復帰へ向かいますが、原因の究明にはまだ時間がかかりました。

At 3:10PM PDT, the team identified the underlying issue and was able to begin restoring performance for the remaining volumes by freeing the excess memory consumed by the misbehaving collection agent. At this point, the system was able to recover most of the remaining stuck volumes; and by 4:15PM PDT, nearly all affected volumes were restored and performing normally.

午後3時10分、担当チームが問題を引き起こした要因を発見、データ収集エージェントのメモリリークを取り除いて残りのボリュームの性能低下を復帰ができるようになる。この時点で、残りの多くのシステムの復帰ができるようになり、午後4時15分、ほとんどすべてのシステムが正常な性能に戻った。

対策：メモリリークの検出とバグの修正

Amazonクラウドは今回の障害に対して次のような対策を示しています。

We have deployed monitoring that will alarm if we see this specific memory leak again in any of our production EBS servers, and next week, we will begin deploying a fix for the memory leak issue. We are also modifying our system memory monitoring on the EBS storage servers to monitor and alarm on each process’s memory consumption, and we will be deploying resource limits to prevent low priority processes from consuming excess resources on these hosts.

私たちはモニタリングの仕組みを新たに展開した。これは、あらゆるEBSストレージサーバでこの手のメモリリークがまた起きたときに警告を発するためのものだ。また来週、メモリリークのバグ修正をデプロイする予定だ。また、EBSストレージサーバのシステムメモリのモニタリングに変更を加え、各プロセスのメモリ消費量をモニタし、メモリ消費量について警告を発することができるようにする。また、プライオリティの低いプロセスのリソースを制限し、ホスト全体のリソースをむやみに消費しないようにする予定だ。

障害の引き金となったDNSについても対策をするとのこと。

We are also updating our internal DNS configuration to further ensure that DNS changes are propagated reliably, and as importantly, make sure that our monitoring and alarming surface issues more quickly should these changes not succeed.

私たちはまた、インターナルDNSの構成を更新し、DNSの変更を確実に伝播するようにし、また重要なことはこうした変更が成功していない場合に迅速に警告を出せるようにする。

加えて、EBSフェイルオーバーのロジックを変更し、急速な性能低下をもたらさないようにするとのことです。