Cindy Sridharan - Testing in Production, the Safe Way (Highlights)

# Cindy Sridharan - Testing in Production, the Safe Way (Highlights) ![rw-book-cover|256](https://readwise-assets.s3.amazonaws.com/static/images/article2.74d541386bbf.png) ## Metadata **Cover**:: https://readwise-assets.s3.amazonaws.com/static/images/article2.74d541386bbf.png **Source**:: #from/readwise **Zettel**:: #zettel/fleeting **Status**:: #x **Authors**:: [[Cindy Sridharan]] **Full Title**:: Testing in Production, the Safe Way **Category**:: #articles #readwise/articles **Category Icon**:: 📰 **URL**:: [copyconstruct.medium.com](https://copyconstruct.medium.com/testing-in-production-the-safe-way-18ca102d0ef1) **Host**:: [[copyconstruct.medium.com]] **Highlighted**:: [[2021-02-06]] **Created**:: [[2022-09-26]] ## Highlights - explore different forms of “testing in production” - The Three Phases of “Production” - as “deploy”, “release”, “ship” - **deploying a service in the production environment doesn’t expose users to that service immediately**. - Your software has started successfully, passed health checks, and is ready (you hope!) to handle production traffic, but may not actually be receiving any. - When we say a version of a service is **released**, we mean that it is responsible for serving production traffic. In verb form, **releasing** is the process of moving production traffic to the new version. - integration testing in production - only test small groups of units together - An integration test takes a small group of units, often two units, and tests their behavior as a whole, verifying that they coherently work together. - sampling real traffic is the only way to reliably capture the request path. - Shadowing is the technique by which production traffic to any given service is captured and replayed against the newly *deployed* version of the service. This - Shadowing is the technique by which production traffic to any given service is captured and replayed against the newly *deployed* version of the service. - Shadowing - Integration Testing - Tap Compare - Our tap-compare tool replays a sample of production traffic against the new system and compares the responses to the old system. - Another tool in this space is GitHub’s [scientist](https://github.com/github/scientist). Scientist was built to test Ruby code but has since been ported over to a [number of different languages](https://github.com/github/scientist#alternatives). - It just runs two code paths and compares the results. - [Diffy](https://github.com/twitter/diffy) is a Scala based tool open sourced by Twitter in 2015 - Load Testing - There’s no dearth of open source load testing tools and frameworks, the most popular ones being [Apache Bench](https://httpd.apache.org/docs/2.4/programs/ab.html), [Gatling](https://gatling.io), [wrk2](https://github.com/giltene/wrk2), Erlang based [Tsung](http://tsung.erlang-projects.org), [Siege](https://www.joedog.org/siege-home/), Scala based Twitter’s [Iago](https://github.com/twitter/iago) (which scrubs the logs of an HTTP server or a proxy or a network sniffer and replays it against a test instance). - Netflix’s [NDBench](https://github.com/Netflix/ndbench) is another open source tool for load testing data stores and comes with support for most of the usual suspects. - [mzbench](https://github.com/satori-com/mzbench) really is the best in class for generating load - Config Tests - Config pushes need to be treated with at least the same care as you do with program binaries, and I argue more because unit testing config is difficult without the compiler. - Canarying - *Canarying* refers to a partial *release* of a service to production. - Monitoring - Exception Tracking - Traffic Shaping - In essence, traffic shifting is achieved by updating the configuration of the load balancer to gradually route more traffic to the newly *released* version. - Feature Flagging or Dark Launch - A/B Testing - Logs/Events, Metrics and Tracing - Profiling - Teeing - Chaos Engineering - Chaos testing refers to techniques such as: — killing random nodes to see if the system is resilient to node failure — injecting error conditions (such as latency) to verify their proper handling — forcing a misbehaving network to see how the service reacts