# Cindy Sridharan - Monitoring in the Time of Cloud Native (Highlights)

## Metadata
**Review**:: [readwise.io](https://readwise.io/bookreview/14695933)
**Source**:: #from/readwise
**Zettel**:: #zettel/fleeting
**Status**:: #x
**Authors**:: [[Cindy Sridharan]]
**Full Title**:: Monitoring in the Time of Cloud Native
**Category**:: #articles #readwise/articles
**Category Icon**:: 📰
**Document Tags**:: #devops
**URL**:: [copyconstruct.medium.com](https://copyconstruct.medium.com/monitoring-in-the-time-of-cloud-native-c87c7a5bfa3e)
**Host**:: [[copyconstruct.medium.com]]
**Highlighted**:: [[2021-02-01]]
**Created**:: [[2022-10-12]]
## Highlights
- Monitoring, to me, connotes something that is inherently both *failure* as well as *human* centric*.* ^309894824
- We strive to make a service reliable enough, but no **more** reliable than it needs to be. ^309894825
- Opting in to the model of *embracing* failure entails designing our services to behave gracefully in the face of failure. ^309894826
- Observability, in my opinion, is really about being able to understand how a system is behaving in production. ^309894827
- An observable system is one that exposes enough *data* about itself so that *generating information* (finding answers to questions yet to be formulated) and easily accessing this information becomes simple. ^309894828
- blackbox monitoring still has its place ^309894829
- “*Whitebox* monitoring” refers to a category of “monitoring” based on the information derived from the internals of systems. ^309894830
- using the data to *alert based on symptoms* ^309894831
- use this data to know the *overall health* of a service ^309894832
- We could use this data to debug ^309894833
- We could also use the data for purposes like *profiling* toderive better understanding about the behavior of our system in production ^309894834
- We might also want to understand how our service depends on other services *currently*, so as to enable us to understand if our service is being impacted by another service, ^309894835
- Hosted solutions like BigQuery ^309894836
- With or without quotas, it becomes important to be able to [dynamically sample logs](https://www.honeycomb.io/blog/dynamic-sampling-by-example/), ^309894837
- the remote write feature that was added to Prometheus about a year ago allowed one to write Prometheus metrics to a custom remote storage engine like OpenTSDB or Graphite, effectively turning Prometheus into a write-through cache. ^309894838
- With the recent introduction of the generic write backend, one can transport time-series from Prometheus over HTTP and Protobuf to any storage system like Kafka or Cassandra. ^309894839
- InfluxDB now natively supports both Prometheus remote reads and writes. ^309894840
- With metrics, however, it’s important to be careful not to explode the label space. ^309894841
- For alerting to be *effective,* it becomes salient to be able to identify a small set of hard failure modes of a system. ^309894842
- Some believe that the ideal number of signals to be “monitored” is anywhere between 3–5, and definitely no more than 7–10. ^309894843