Cindy Sridharan - Distributed Systems Observability (Highlights)

# Cindy Sridharan - Distributed Systems Observability (Highlights) ![rw-book-cover|256](https://learning.oreilly.com/library/view/distributed-systems-observability/9781492033431/ibis_generated_cover_thumbnail.jpg) ## Metadata **Cover**:: https://learning.oreilly.com/library/view/distributed-systems-observability/9781492033431/ibis_generated_cover_thumbnail.jpg **Source**:: #from/readwise **Zettel**:: #zettel/fleeting **Status**:: #x **Authors**:: [[Cindy Sridharan]] **Full Title**:: Distributed Systems Observability **Category**:: #books #readwise/books **Category Icon**:: 📚 **Highlighted**:: [[2021-02-05]] **Created**:: [[2022-09-26]] ## Highlights ### 1. The Need for Observability - A system can be deployed incrementally and in a manner such that a rollback (or roll forward) can be triggered if a key set of metrics deviate from the baseline. - A system can be tested in a manner such that any of the hard, actionable failure modes (the sort that often result in alerts once the system has been deployed) can be surfaced during the time of testing. - a system can be able to report enough data points about its health and behavior when serving real traffic, so that the system can be understood, debugged, and evolved. - A system can be built in a way that lends itself well to being tested in a realistic manner (which involves a certain degree of testing in production). ### 2. Monitoring and Observability - Observability is a superset of monitoring. - A good set of metrics used for monitoring purposes are the USE metrics and the RED metrics. - The USE method calls for measuring utilization, saturation, and errors of primarily system resources, such as available free memory (utilization), CPU run queue length (saturation), or device errors (errors). - The USE methodology for analyzing system performance was coined by Brendan Gregg. - The RED method was proposed by Tom Wilkie - The RED method calls for monitoring the request rate, error rate, and duration of request (generally represented via a histogram) - The degree of a system’s observability is the degree to which it can be debugged. - Debugging is often an iterative process that involves the following: Starting with a high-level metric Being able to drill down by introspecting various fine-grained, contextual observations reported by various parts of the system Being able to make the right deductions Testing whether the theory holds water - The most effective debugging tool is still careful thought, coupled with judiciously placed print statements. ### 3. Coding and Testing for Observability - Coding for Failure - Coding for failure entails acknowledging that systems will fail - Understanding the operational semantics of the application Understanding the operational characteristics of the dependencies Writing code that’s debuggable - Testing for Failure - program testing can be used very effectively to show the presence of bugs but never to show their absence. ### 4. The Three Pillars of Observability - event logs are especially helpful for uncovering emergent and unpredictable behaviors exhibited by components of a distributed system. - Start with a symptom pinpointed by a high-level metric or a log event in a specific system Infer the request lifecycle across different components of the distributed architecture Iteratively ask questions about interactions among various parts of the system - Traces and metrics are an abstraction built on top of logs that pre-process and encode information along two orthogonal axes, one being request-centric (trace), the other being system-centric (metric). - Logs are, by far, the easiest to generate. - Logs are also easy to instrument - application as a whole becomes susceptible to suboptimal performance due to the overhead of logging. - log messages can also be lost - RELP is a protocol that uses a command-response model - The RELP server is designed to throttle the number of outstanding commands to conserve resources. - unless the logging library can dynamically sample logs, logging excessively has the capability to adversely affect application performance as a whole. - Logstash, fluentd, Scribe, or Heka - Elasticsearch or BigQuery - buffering in a broker like Kafka - Log processing neatly fits into the bill of Online Analytics Processing (OLAP). - using minimal indexing. - An alternative is Humio, a hosted and on-premises solution that treats log processing as a stream processing problem. - Yet another alternative is Honeycomb - metrics transfer and storage has a constant overhead - Metrics are also better suited to trigger alerts - Using high cardinality values like UIDs as metric labels can overwhelm time-series databases. - Zipkin and Jaeger are two of the most popular OpenTracing-compliant open source distributed tracing solutions. - Tracing is, by far, the hardest to retrofit into an existing infrastructure, - the rise of service meshes make integrating tracing functionality almost effortless. - A counter and log at every major entry and exit point of a request - A log and trace at every decision point of a request - To reconstruct the codepath taken by reading a trace - To dervive request or error ratios from any single point in the codepath