Cindy Sridharan - Monitoring in the Time of Cloud Native (Highlights)

# Cindy Sridharan - Monitoring in the Time of Cloud Native (Highlights) ![rw-book-cover|256](https://readwise-assets.s3.amazonaws.com/static/images/article0.00998d930354.png) ## Metadata **Review**:: [readwise.io](https://readwise.io/bookreview/14695933) **Source**:: #from/readwise **Zettel**:: #zettel/fleeting **Status**:: #x **Authors**:: [[Cindy Sridharan]] **Full Title**:: Monitoring in the Time of Cloud Native **Category**:: #articles #readwise/articles **Category Icon**:: 📰 **Document Tags**:: #devops **URL**:: [copyconstruct.medium.com](https://copyconstruct.medium.com/monitoring-in-the-time-of-cloud-native-c87c7a5bfa3e) **Host**:: [[copyconstruct.medium.com]] **Highlighted**:: [[2021-02-01]] **Created**:: [[2022-10-12]] ## Highlights - Monitoring, to me, connotes something that is inherently both *failure* as well as *human* centric*.* ^309894824 - We strive to make a service reliable enough, but no **more** reliable than it needs to be. ^309894825 - Opting in to the model of *embracing* failure entails designing our services to behave gracefully in the face of failure. ^309894826 - Observability, in my opinion, is really about being able to understand how a system is behaving in production. ^309894827 - An observable system is one that exposes enough *data* about itself so that *generating information* (finding answers to questions yet to be formulated) and easily accessing this information becomes simple. ^309894828 - blackbox monitoring still has its place ^309894829 - “*Whitebox* monitoring” refers to a category of “monitoring” based on the information derived from the internals of systems. ^309894830 - using the data to *alert based on symptoms* ^309894831 - use this data to know the *overall health* of a service ^309894832 - We could use this data to debug ^309894833 - We could also use the data for purposes like *profiling* toderive better understanding about the behavior of our system in production ^309894834 - We might also want to understand how our service depends on other services *currently*, so as to enable us to understand if our service is being impacted by another service, ^309894835 - Hosted solutions like BigQuery ^309894836 - With or without quotas, it becomes important to be able to [dynamically sample logs](https://www.honeycomb.io/blog/dynamic-sampling-by-example/), ^309894837 - the remote write feature that was added to Prometheus about a year ago allowed one to write Prometheus metrics to a custom remote storage engine like OpenTSDB or Graphite, effectively turning Prometheus into a write-through cache. ^309894838 - With the recent introduction of the generic write backend, one can transport time-series from Prometheus over HTTP and Protobuf to any storage system like Kafka or Cassandra. ^309894839 - InfluxDB now natively supports both Prometheus remote reads and writes. ^309894840 - With metrics, however, it’s important to be careful not to explode the label space. ^309894841 - For alerting to be *effective,* it becomes salient to be able to identify a small set of hard failure modes of a system. ^309894842 - Some believe that the ideal number of signals to be “monitored” is anywhere between 3–5, and definitely no more than 7–10. ^309894843