2012年のクリスマスイブ、Amazonクラウドから降ってきたシステム障害。原因はオペレーションミス

2013年1月8日

2012年12月24日、クリスマスイブの夜にオンライン映画をNetflixで楽しもうとしていた北米の人たちをがっかりさせる出来事が起きました。Amazonクラウドに障害が発生し、その影響でNetflixのWebサイトや動画の再生がトラブルに見舞われたのです。

Summary of the December 24, 2012 Amazon ELB Service Event in the US-East Region

障害が発生したのは、Amazonクラウドの米東部リージョン。発生の経緯と原因をAmazonクラウドが詳細に報告しています。

メンテナンス時のミスでデータを消去

クリスマスイブの障害は、ロードバランサーという比較的上位のサービスで起きたオペレーションミスが原因でした。ポイントを追っていきましょう。

The service disruption began at 12:24 PM PST on December 24th when a portion of the ELB state data was logically deleted. This data is used and maintained by the ELB control plane to manage the configuration of the ELB load balancers in the region (for example tracking all the backend hosts to which traffic should be routed by each load balancer). The data was deleted by a maintenance process that was inadvertently run against the production ELB state data.

サービスの崩壊は12月24日 12時24分（太平洋時間）に始まった。このとき、Elastic Load Blancing（ELB）の一部のステートデータが論理的に削除された。このデータはELBコントロールプレーンが、リージョン内のELBのロードバランサーのコンフィグレーションを管理、維持するために利用されていた（例として、ロードバランサーごとにトラフィックがどのバックエンドホストへルーティングされるかのトラッキングがあげられる）

データはメンテナンスのプロセスで削除された。これは運用中のELBステートデータに対してうっかり実行されてしまったものだ。

ステートデータが削除されたことは気づかれず、また影響を受けたELBは少数でしたが、それらはレイテンシとエラーが拡大していきました。

ここから数時間、テクニカルチームはエラーを起こす一部のELBと、問題なく動作している多数のELBについて、なぜそうなっているのか疑問だったとのこと。

しかしELBの設定変更をきっかけに、徐々に問題が拡大していきます。

As this continued, some customers began to experience performance issues with their running load balancers. These issues only occurred after the ELB control plane attempted to make changes to a running load balancer. When a user modifies a load balancer configuration or a load balancer needs to scale up or down, the ELB control plane makes changes to the load balancer configuration.

この状況がしばらく続いた後、いくつかのユーザーが実行中のロードバランサーで性能上の問題が発生し始めた。これは、ELBのコントロールプレーンがロードバランサーの変更を行おうとしたときに発生したのだ。ユーザーがロードバランサーの設定変更やロードバランサーがスケールアップやダウンとなったときに、ELBコントロールプレーンはロードバランサーに対して設定変更を行う。

During this event, because the ELB control plane lacked some of the necessary ELB state data to successfully make these changes, load balancers that were modified were improperly configured by the control plane.

ELBコントロールプレーンは設定変更のために必要なELBステートデータが足りないため、ロードバランサーは不適切な設定へと変更されてしまう。

テクニカルチームもここで問題が発生していることを認識し、本格的な原因の調査に取りかかります。

It was when the ELB technical team started digging deeply into these degraded load balancers that the team identified the missing ELB state data as the root cause of the service disruption.

ELBテクニカルチームはロードバランサーの性能低下を本格的に調査し始め、ELBステートデータが失われていることが原因であることを突き止めた。

復旧のためELBのワークフローを停止

原因を突き止めた時点で、次のアクションは失われたデータを復活させ、これ以上の被害拡大を食い止めることに移りました。

At 5:02 PM PST, the team disabled several of the ELB control plane workflows (including the scaling and descaling workflows) to prevent additional running load balancers from being affected by the missing ELB state data. At the peak of the event, 6.8% of running ELB load balancers were impacted. The rest of the load balancers in the system were unable to scale or be modified by customers, but were operating correctly.

午後5時2分、チームはいくつかのELBコントロールプレーンの（スケールアップ、スケールダウンも含む）ワークフローを止め、これ以上の被害拡大を食い止めようとした。このとき、最大で稼働中のELBロードバランサーの6.8％が障害の影響を受けており、残りのロードバランサーは、顧客によるスケールアップ／ダウンや設定変更が停止されたが、正常に稼働していた。

運用チームはこの日夕方から深夜にかけて、マニュアル作業で失われたステートデータの復旧を試みます。

At 2:45 AM PST on December 25th, the team successfully restored a snapshot of the ELB state data to a time just before the data was deleted. The team then began merging this restored data with the system state changes that happened between this snapshot and the current time. By 5:40 AM PST, this data merge had been completed and the new ELB state data had been verified. The team then began slowly re-enabling the ELB service workflows and APIs.

12月25日午前2時45分。チームはステートデータが削除される以前のスナップショットのリストアに成功。現時点でのデータとのマージ作業を開始した。午前5時40分までに、データのマージ作業は完了し、新しいELBステートデータのベリファイも行われた。その後、慎重にELBサービスのワークフローとAPIが再開された。

そして午前10時30分までに、障害の影響を受けたELBもふくめてフルオペレーションに復帰。午後0時5分までに正常運用にもどったことを確認しています。

教訓と今後の対応

今回のELBの障害を受けて、Amazonクラウドは以下のような対処を行うとしています。

First, we have modified the access controls on our production ELB state data to prevent inadvertent modification without specific Change Management (CM) approval.

まず第一に、運用中のELBステートデータに対するアクセスコントロールを変更した。不適切な変更を防止するため、変更管理システムの承認なしには変更できないようにした。

また、データリカバリーの改善も行うと。

We have also modified our data recovery process to reflect the learning we went through in this event. We are confident that we could recover ELB state data in a similar event significantly faster (if necessary) for any future operational event.

データリカバリープロセスも、この障害を通じて学んだことを反映して変更した。ELBステートデータに対して、これからはより迅速にリカバリできることを確信している。