# Ben Maurer - Fail at Scale - ACM Queue (Highlights) ![rw-book-cover|256](https://readwise-assets.s3.amazonaws.com/static/images/article1.be68295a7e40.png) ## Metadata **Cover**:: https://readwise-assets.s3.amazonaws.com/static/images/article1.be68295a7e40.png **Source**:: #from/readwise **Zettel**:: #zettel/fleeting **Status**:: #x **Authors**:: [[Ben Maurer]] **Full Title**:: Fail at Scale - ACM Queue **Category**:: #articles #readwise/articles **Category Icon**:: 📰 **Document Tags**:: #sre **URL**:: [queue.acm.org](https://queue.acm.org/detail.cfm?id=2839461) **Host**:: [[queue.acm.org]] **Highlighted**:: [[2021-02-12]] **Created**:: [[2022-09-26]] ## Highlights - Reliability in the face of rapid change - Failure is part of engineering any large-scale system. ### Why Do Failures Happen? #### Individual machine failures - The key to avoiding individual machine failure is automation. - Automation works best by combining known failure patterns - with a search for symptoms of an unknown problem - or example, by swapping out servers with unusually slow response times - When automation finds symptoms of an unknown problem, manual investigation can help develop better tools to detect and fix future problems. #### Legitimate workload changes - Sometimes Facebook users change their behavior in a way that poses challenges for our infrastructure. #### Human error - Three Easy Ways to Cause an Incident - Rapidly deployed configuration changes - • Make everybody use a common configuration system. - • Statically validate configuration changes. - • Run a canary. - • Hold on to good configurations. - • Make it easy to revert. - Hard dependencies on core services - • Cache data from core services. - • Provide hardened APIs to use core services. - • Run fire drills. - ranging from fault injection applied to a single server to manually triggered outages of entire data centers. - Increased latency and resource exhaustion ### • Controlled Delay. - We experimented with a variant of the CoDel1 (controlled delay) algorithm: CoDel (controlled delay) algorithm; http://queue.acm.org/detail.cfm?id=2209336. ``` onNewRequest(req, queue): if (queue.lastEmptyTime() < (now - N seconds)) { timeout = M ms } else { timeout = N seconds; } queue.enqueue(req, timeout) ``` - We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases. - Facebook's open source Wangle library5 provides an implementation of this algorithm which is used by our Thrift4 framework. 4. Thrift framework; https://github.com/facebook/fbthrift. 5. Wangle library; https://github.com/facebook/wangle/blob/master/wangle/concurrent/Codel.cpp. ### • Adaptive LIFO (last-in, first-out) - the first-in request has often been sitting around for so long that the user may have aborted the action that generated the request - During normal operating conditions, requests are processed in FIFO order, but when a queue is starting to form, the server switches to LIFO mode. ### • Concurrency Control. - Some failures are so severe, however, that server-side controls are not able to kick in. - Each client keeps track of the number of outstanding outbound requests on a per-service basis. When new requests are sent, if the number of outstanding requests to that service exceeds a configurable number, the request is immediately marked as an error. Client end throttle ### Tools that Help Diagnose Failures - High-Density Dashboards with Cubism - our dashboards grew so large that it was difficult to navigate them quickly - To address this, we built our top-level dashboards using Cubism 2. Cubism; https://square.github.io/cubism/. - What just changed? - We collect information about recent changes ranging from configuration changes to deployments of new software in a tool called OpsStream. ### Learning from Failure - After failures happen, our incident-review process helps us learn from these incidents. - Facebook has developed a methodology called DERP (for detection, escalation, remediation, and prevention) to aid in productive incident reviews. - Detection. How was the issue detected - Escalation. Did the right people get involved quickly? Could these people have been brought in via alarms rather than manually? - Remediation. What steps were taken to fix the issue? Can these steps be automated? - Prevention. What improvements could remove the risk of this type of failure happening again?