# Ben Maurer - Fail at Scale - ACM Queue (Highlights)

## Metadata
**Cover**:: https://readwise-assets.s3.amazonaws.com/static/images/article1.be68295a7e40.png
**Source**:: #from/readwise
**Zettel**:: #zettel/fleeting
**Status**:: #x
**Authors**:: [[Ben Maurer]]
**Full Title**:: Fail at Scale - ACM Queue
**Category**:: #articles #readwise/articles
**Category Icon**:: 📰
**Document Tags**:: #sre
**URL**:: [queue.acm.org](https://queue.acm.org/detail.cfm?id=2839461)
**Host**:: [[queue.acm.org]]
**Highlighted**:: [[2021-02-12]]
**Created**:: [[2022-09-26]]
## Highlights
- Reliability in the face of rapid change
- Failure is part of engineering any large-scale system.
### Why Do Failures Happen?
#### Individual machine failures
- The key to avoiding individual machine failure is automation.
- Automation works best by combining known failure patterns
- with a search for symptoms of an unknown problem
- or example, by swapping out servers with unusually slow response times
- When automation finds symptoms of an unknown problem, manual investigation can help develop better tools to detect and fix future problems.
#### Legitimate workload changes
- Sometimes Facebook users change their behavior in a way that poses challenges for our infrastructure.
#### Human error
- Three Easy Ways to Cause an Incident
- Rapidly deployed configuration changes
- • Make everybody use a common configuration system.
- • Statically validate configuration changes.
- • Run a canary.
- • Hold on to good configurations.
- • Make it easy to revert.
- Hard dependencies on core services
- • Cache data from core services.
- • Provide hardened APIs to use core services.
- • Run fire drills.
- ranging from fault injection applied to a single server to manually triggered outages of entire data centers.
- Increased latency and resource exhaustion
### • Controlled Delay.
- We experimented with a variant of the CoDel1 (controlled delay) algorithm:
CoDel (controlled delay) algorithm; http://queue.acm.org/detail.cfm?id=2209336.
```
onNewRequest(req, queue):
if (queue.lastEmptyTime() < (now - N seconds)) {
timeout = M ms
} else {
timeout = N seconds;
}
queue.enqueue(req, timeout)
```
- We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases.
- Facebook's open source Wangle library5 provides an implementation of this algorithm which is used by our Thrift4 framework.
4. Thrift framework; https://github.com/facebook/fbthrift.
5. Wangle library; https://github.com/facebook/wangle/blob/master/wangle/concurrent/Codel.cpp.
### • Adaptive LIFO (last-in, first-out)
- the first-in request has often been sitting around for so long that the user may have aborted the action that generated the request
- During normal operating conditions, requests are processed in FIFO order, but when a queue is starting to form, the server switches to LIFO mode.
### • Concurrency Control.
- Some failures are so severe, however, that server-side controls are not able to kick in.
- Each client keeps track of the number of outstanding outbound requests on a per-service basis. When new requests are sent, if the number of outstanding requests to that service exceeds a configurable number, the request is immediately marked as an error.
Client end throttle
### Tools that Help Diagnose Failures
- High-Density Dashboards with Cubism
- our dashboards grew so large that it was difficult to navigate them quickly
- To address this, we built our top-level dashboards using Cubism
2. Cubism; https://square.github.io/cubism/.
- What just changed?
- We collect information about recent changes ranging from configuration changes to deployments of new software in a tool called OpsStream.
### Learning from Failure
- After failures happen, our incident-review process helps us learn from these incidents.
- Facebook has developed a methodology called DERP (for detection, escalation, remediation, and prevention) to aid in productive incident reviews.
- Detection. How was the issue detected
- Escalation. Did the right people get involved quickly? Could these people have been brought in via alarms rather than manually?
- Remediation. What steps were taken to fix the issue? Can these steps be automated?
- Prevention. What improvements could remove the risk of this type of failure happening again?