# Martin Kleppmann - Designing Data-Intensive Applications (Highlights)

## Metadata
**Review**:: [readwise.io](https://readwise.io/bookreview/55987815)
**Source**:: #from/readwise #from/zotero
**Zettel**:: #zettel/fleeting
**Status**:: #x
**Authors**:: [[Martin Kleppmann]]
**Full Title**:: Designing Data-Intensive Applications
**Category**:: #books #readwise/books
**Category Icon**:: 📚
**Highlighted**:: [[2025-11-01]]
**Created**:: [[2025-11-07]]
## Highlights
- We call an application data-intensive if data is its primary challenge—the quantity of data, the complexity of data, or the speed at which it is changing—as opposed to compute-intensive, where CPU cycles are the bottleneck. ([Page 9](zotero://open-pdf/library/items/NUWYWJMI?page=8&annotation=MG3H8C46)) ^953405229
- This book is a journey through both the principles and the practicalities of data systems, and how you can use them to build data-intensive applications. We will explore what different tools have in common, what distinguishes them, and how they achieve their characteristics. ([Page 22](zotero://open-pdf/library/items/NUWYWJMI?page=21&annotation=9AB922A5)) ^953405230
#motivation
- Note that a fault is not the same as a failure [2]. A fault is usually defined as one component of the system deviating from its spec, whereas a failure is when the system as a whole stops providing the required service to the user. ([Page 27](zotero://open-pdf/library/items/NUWYWJMI?page=26&annotation=M58R4K72)) ^953405231
- Many critical bugs are actually due to poor error handling [3]; by deliberately inducing faults, you ensure that the fault-tolerance machinery is continually exercised and tested, which can increase your confidence that faults will be handled correctly when they occur naturally. The Netflix Chaos Monkey [4] is an example of this approach. ([Page 27](zotero://open-pdf/library/items/NUWYWJMI?page=26&annotation=BG29PNNM)) ^953405232
- As long as you can restore a backup onto a new machine fairly quickly, the downtime in case of failure is not catastrophic in most applications. ([Page 28](zotero://open-pdf/library/items/NUWYWJMI?page=27&annotation=83LDLME7)) ^953405233
- However, as data volumes and applications’ computing demands have increased, more applications have begun using larger numbers of machines, which proportionally increases the rate of hardware faults. ([Page 28](zotero://open-pdf/library/items/NUWYWJMI?page=27&annotation=F5BPDR74)) ^953405234
- However, if the interfaces are too restrictive people will work around them, negating their benefit, so this is a tricky balance to get right. ([Page 30](zotero://open-pdf/library/items/NUWYWJMI?page=29&annotation=FTY5MVR5)) ^953405235
- Amazon describes response time requirements for internal services in terms of the 99.9th percentile, even though it only affects 1 in 1,000 requests. ([Page 41](zotero://open-pdf/library/items/NUWYWJMI?page=40&annotation=BE9J3L8Q)) ^953405236
- each layer hides the complexity of the layers below it by providing a clean data model ([Page 59](zotero://open-pdf/library/items/NUWYWJMI?page=58&annotation=I4QDBWX4)) ^953405237
- Removing such duplication is the key idea behind normalization in databases. ([Page 69](zotero://open-pdf/library/items/NUWYWJMI?page=68&annotation=ZLILUIF3)) ^953405238
- The fact that SQL is more limited in functionality gives the database much more room for automatic optimizations. ([Page 81](zotero://open-pdf/library/items/NUWYWJMI?page=80&annotation=MSRFGI45)) ^953405239
- Declarative languages have a better chance of getting faster in parallel execution because they specify only the pattern of the results, not the algorithm that is used to determine the results. ([Page 82](zotero://open-pdf/library/items/NUWYWJMI?page=81&annotation=JWIRKXHZ)) ^953405240
- Higher-level query languages like SQL can be implemented as a pipeline of MapReduce operations (see Chapter 10) ([Page 87](zotero://open-pdf/library/items/NUWYWJMI?page=86&annotation=IK88CXX5)) ^953405241
- A graph consists of two kinds of objects: vertices (also known as nodes or entities) and edges (also known as relationships or arcs). ([Page 88](zotero://open-pdf/library/items/NUWYWJMI?page=87&annotation=B47K5RWU)) ^953405242
- In this section we will discuss the property graph model (implemented by Neo4j, Titan, and InfiniteGraph) and the triple-store model (implemented by Datomic, AllegroGraph, and others). ([Page 91](zotero://open-pdf/library/items/NUWYWJMI?page=90&annotation=QYQLGSGR)) ^953405243
- In a graph query, you may need to traverse a variable number of edges before you find the vertex you’re looking for—that is, the number of joins is not fixed in advance. ([Page 95](zotero://open-pdf/library/items/NUWYWJMI?page=94&annotation=U62BYX9G)) ^953405244
- In Cypher, :WITHIN*0.. expresses that fact very concisely: it means “follow a WITHIN edge, zero or more times.” ([Page 96](zotero://open-pdf/library/items/NUWYWJMI?page=95&annotation=BHJV2XJR)) ^953405245
- Since SQL:1999, this idea of variable-length traversal paths in a query can be expressed using something called recursive common table expressions (the WITH RECURSIVE syntax). ([Page 96](zotero://open-pdf/library/items/NUWYWJMI?page=95&annotation=DI932BL5)) ^953405246
- In a triple-store, all information is stored in the form of very simple threepart statements: (subject, predicate, object). For example, in the triple (Jim, likes, bananas), Jim is the subject, likes is the predicate (verb), and bananas is the object. ([Page 98](zotero://open-pdf/library/items/NUWYWJMI?page=97&annotation=G8I9CI7Z)) ^953405247
- SPARQL is a query language for triple-stores using the RDF data model [43]. (It is an acronym for SPARQL Protocol and RDF Query Language, pronounced “sparkle.”) It predates Cypher, and since Cypher’s pattern matching is borrowed from SPARQL, they look quite similar [37]. ([Page 101](zotero://open-pdf/library/items/NUWYWJMI?page=100&annotation=94Z5PGDL)) ^953405248
- Datalog is a much older language than SPARQL or Cypher, having been studied extensively by academics in the 1980s [44, 45, 46]. It is less well known among software engineers, but it is nevertheless important, because it provides the foundation that later query languages build upon. ([Page 104](zotero://open-pdf/library/items/NUWYWJMI?page=103&annotation=8ABCCHAU)) ^953405249
- Datalog’s data model is similar to the triple-store model, generalized a bit. Instead of writing a triple as (subject, predicate, object), we write it as predicate(subject, object). ([Page 104](zotero://open-pdf/library/items/NUWYWJMI?page=103&annotation=M8ZNREGH)) ^953405250
- Cypher and SPARQL jump in right away with SELECT, but Datalog takes a small step at a time. We define rules that tell the database about new predicates: here, we define two new predicates, within_recursive and migrated. These predicates aren’t triples stored in the database, but instead they are derived from data or from other rules. Rules can refer to other rules, just like functions can call other functions or recursively call themselves. ([Page 105](zotero://open-pdf/library/items/NUWYWJMI?page=104&annotation=ZXJB9SMV)) ^953405251
- In rules, words that start with an uppercase letter are variables, and predicates are matched like in Cypher and SPARQL. ([Page 105](zotero://open-pdf/library/items/NUWYWJMI?page=104&annotation=PNK8WWKP)) ^953405252
Like pattern matching in Erlang
- Now we can make a simple change to the format of our segment files: we require that the sequence of key-value pairs is sorted by key. We call this format Sorted String Table, or SSTable for short. ([Page 129](zotero://open-pdf/library/items/NUWYWJMI?page=128&annotation=W999FARV)) ^953405253
- Since read requests need to scan over several key-value pairs in the requested range anyway, it is possible to group those records into a block and compress it before writing it to disk (indicated by the shaded area in Figure 3-5). ([Page 133](zotero://open-pdf/library/items/NUWYWJMI?page=132&annotation=4L5CLMUD)) ^953405254
- It only suffers from one problem: if the database crashes, the most recent writes (which are in the memtable but not yet written out to disk) are lost. In order to avoid that problem, we can keep a separate log on disk to which every write is immediately appended, just like in the previous section. ([Page 134](zotero://open-pdf/library/items/NUWYWJMI?page=133&annotation=8UHZTXMH)) ^953405255
- The algorithm described here is essentially what is used in LevelDB [6] and RocksDB [7], key-value storage engine libraries that are designed to be embedded into other applications. ([Page 134](zotero://open-pdf/library/items/NUWYWJMI?page=133&annotation=4I8ALNCR)) ^953405256
- Originally this indexing structure was described by Patrick O’Neil et al. under the name Log-Structured Merge-Tree (or LSM-Tree) [10], building on earlier work on log-structured filesystems [11]. Storage engines that are based on this principle of merging and compacting sorted files are often called LSM storage engines. ([Page 134](zotero://open-pdf/library/items/NUWYWJMI?page=133&annotation=NLFRSMDG)) ^953405257
- For example, the LSM-tree algorithm can be slow when looking up keys that do not exist in the database: you have to check the memtable, then the segments all the way back to the oldest (possibly having to read from disk for each one) before you can be sure that the key does not exist. In order to optimize this kind of access, storage engines often use additional Bloom filters [15]. ([Page 135](zotero://open-pdf/library/items/NUWYWJMI?page=134&annotation=3S4RXJGY)) ^953405258
- In leveled compaction, the key range is split up into smaller SSTables and older data is moved into separate “levels,” which allows the compaction to proceed more incrementally and use less disk space. ([Page 135](zotero://open-pdf/library/items/NUWYWJMI?page=134&annotation=WBB82QUX)) ^953405259
#further-action
- Introduced in 1970 [17] and called “ubiquitous” less than 10 years later [18], B-trees have stood the test of time very well. They remain the standard index implementation in almost all relational databases, and many nonrelational databases use them too. ([Page 136](zotero://open-pdf/library/items/NUWYWJMI?page=135&annotation=4BE3RZU5)) ^953405260
- In order to make the database resilient to crashes, it is common for B-tree implementations to include an additional data structure on disk: a writeahead log (WAL, also known as a redo log). ([Page 140](zotero://open-pdf/library/items/NUWYWJMI?page=139&annotation=KAZ6BRWE)) ^953405261
LSM also needs a log.
- We can save space in pages by not storing the entire key, but abbreviating it. ([Page 141](zotero://open-pdf/library/items/NUWYWJMI?page=140&annotation=JCI5JX5S)) ^953405262
- B-tree variants such as fractal trees [22] borrow some log-structured ideas to reduce disk seeks (and they have nothing to do with fractals). ([Page 142](zotero://open-pdf/library/items/NUWYWJMI?page=141&annotation=PI9P6B3F)) ^953405263
- As a rule of thumb, LSM-trees are typically faster for writes, whereas B-trees are thought to be faster for reads [23]. ([Page 142](zotero://open-pdf/library/items/NUWYWJMI?page=141&annotation=9BNETMS7)) ^953405264
- Some storage engines even overwrite the same page twice in order to avoid ending up with a partially updated page in the event of a power failure [24, 25]. ([Page 142](zotero://open-pdf/library/items/NUWYWJMI?page=141&annotation=3LYXD4L7)) ^953405265
- This effect—one write to the database resulting in multiple writes to the disk over the course of the database’s lifetime—is known as write amplification. ([Page 143](zotero://open-pdf/library/items/NUWYWJMI?page=142&annotation=WFFA8T67)) ^953405266
- LSM-trees can be compressed better, and thus often produce smaller files on disk than B-trees. ([Page 143](zotero://open-pdf/library/items/NUWYWJMI?page=142&annotation=PFCS8XDK)) ^953405267
- On many SSDs, the firmware internally uses a log-structured algorithm to turn random writes into sequential writes on the underlying storage chips, so the impact of the storage engine’s write pattern is less pronounced [19]. ([Page 143](zotero://open-pdf/library/items/NUWYWJMI?page=142&annotation=VNPWAXSM)) ^953405268
- A downside of log-structured storage is that the compaction process can sometimes interfere with the performance of ongoing reads and writes. ([Page 144](zotero://open-pdf/library/items/NUWYWJMI?page=143&annotation=LM8X3LU7)) ^953405269
Similar to GC
- An advantage of B-trees is that each key exists in exactly one place in the index, whereas a log-structured storage engine may have multiple copies of the same key in different segments. This aspect makes B-trees attractive in databases that want to offer strong transactional semantics: in many relational databases, transaction isolation is implemented using locks on ranges of keys, and in a B-tree index, those locks can be directly attached to the tree [5]. ([Page 144](zotero://open-pdf/library/items/NUWYWJMI?page=143&annotation=IHTCCAJZ)) ^953405270
- The situation is more complicated if the new value is larger, as it probably needs to be moved to a new location in the heap where there is enough space. In that case, either all indexes need to be updated to point at the new heap location of the record, or a forwarding pointer is left behind in the old heap location [5]. ([Page 146](zotero://open-pdf/library/items/NUWYWJMI?page=145&annotation=6KN6WN6H)) ^953405271
- In some situations, the extra hop from the index to the heap file is too much of a performance penalty for reads, so it can be desirable to store the indexed row directly within an index. This is known as a clustered index. ([Page 146](zotero://open-pdf/library/items/NUWYWJMI?page=145&annotation=XCK5G5VT)) ^953405272
- PostGIS implements geospatial indexes as R-trees using PostgreSQL’s Generalized Search Tree indexing facility [35]. ([Page 148](zotero://open-pdf/library/items/NUWYWJMI?page=147&annotation=67JVJSPQ)) ^953405273
- Products such as VoltDB, MemSQL, and Oracle TimesTen are in-memory databases with a relational model, and the vendors claim that they can offer big performance improvements by removing all the overheads associated with managing on-disk data structures [41, 42]. ([Page 150](zotero://open-pdf/library/items/NUWYWJMI?page=149&annotation=5VJ2LIZV)) ^953405274
- RAMCloud is an open source, in-memory key-value store with durability (using a log-structured approach for the data in memory as well as the data on disk) [43]. ([Page 150](zotero://open-pdf/library/items/NUWYWJMI?page=149&annotation=VLKDN66Q)) ^953405275
- Rather, they can be faster because they can avoid the overheads of encoding in-memory data structures in a form that can be written to disk [44]. ([Page 150](zotero://open-pdf/library/items/NUWYWJMI?page=149&annotation=69PE9SQN)) ^953405276
Zero copy
- The socalled anti-caching approach works by evicting the least recently used data from memory to disk when there is not enough memory, and loading it back into memory when it is accessed again in the future. ([Page 150](zotero://open-pdf/library/items/NUWYWJMI?page=149&annotation=DJP3TAF4)) ^953405277
- Transaction processing just means allowing clients to make lowlatency reads and writes—as opposed to batch processing jobs, which only run periodically (for example, once per day). ([Page 151](zotero://open-pdf/library/items/NUWYWJMI?page=150&annotation=4LYNICSS)) ^953405278
- Because these applications are interactive, the access pattern became known as online transaction processing (OLTP). ([Page 151](zotero://open-pdf/library/items/NUWYWJMI?page=150&annotation=AXYHFZUX)) ^953405279
#definition
- In order to differentiate this pattern of using databases from transaction processing, it has been called online analytic processing (OLAP) [47]. ([Page 152](zotero://open-pdf/library/items/NUWYWJMI?page=151&annotation=455QAMQU)) ^953405280
- A data warehouse, by contrast, is a separate database that analysts can query to their hearts’ content, without affecting OLTP operations [48]. ([Page 156](zotero://open-pdf/library/items/NUWYWJMI?page=155&annotation=9HPBT5C3)) ^953405281
Motivation to use a data warehouse
- This process of getting data into the warehouse is known as Extract–Transform–Load (ETL) and is illustrated in Figure 3-8. ([Page 156](zotero://open-pdf/library/items/NUWYWJMI?page=155&annotation=ZXNIDMPH)) ^953405282
- More recently, a plethora of open source SQL-on-Hadoop projects have emerged; they are young but aiming to compete with commercial data warehouse systems. These include Apache Hive, Spark SQL, Cloudera Impala, Facebook Presto, Apache Tajo, and Apache Drill [52, 53]. Some of them are based on ideas from Google’s Dremel [54]. ([Page 159](zotero://open-pdf/library/items/NUWYWJMI?page=158&annotation=W3TLIGIU)) ^953405283
- Many data warehouses are used in a fairly formulaic style, known as a star schema (also known as dimensional modeling [55]). ([Page 159](zotero://open-pdf/library/items/NUWYWJMI?page=158&annotation=P9M6DTSL)) ^953405284
- At the center of the schema is a so-called fact table (in this example, it is called fact_sales). Each row of the fact table represents an event that occurred at a particular time (here, each row represents a customer’s purchase of a product). ([Page 159](zotero://open-pdf/library/items/NUWYWJMI?page=158&annotation=XXGT565Q)) ^953405285
- Other columns in the fact table are foreign key references to other tables, called dimension tables. As each row in the fact table represents an event, the dimensions represent the who, what, where, when, how, and why of the event. ([Page 161](zotero://open-pdf/library/items/NUWYWJMI?page=160&annotation=ICPTTIX8)) ^953405286
- Even date and time are often represented using dimension tables, because this allows additional information about dates (such as public holidays) to be encoded, allowing queries to differentiate between sales on holidays and non-holidays. ([Page 161](zotero://open-pdf/library/items/NUWYWJMI?page=160&annotation=E2E4K382)) ^953405287
- A variation of this template is known as the snowflake schema, where dimensions are further broken down into subdimensions. ([Page 161](zotero://open-pdf/library/items/NUWYWJMI?page=160&annotation=LTYG65IQ)) ^953405288
- The idea behind column-oriented storage is simple: don’t store all the values from one row together, but store all the values from each column together instead. ([Page 163](zotero://open-pdf/library/items/NUWYWJMI?page=162&annotation=VGR6ZL27)) ^953405289
- Parquet [57] is a columnar storage format that supports a document data model, based on Google’s Dremel [54]. ([Page 163](zotero://open-pdf/library/items/NUWYWJMI?page=162&annotation=WUIMYWN4)) ^953405290
- Cassandra and HBase have a concept of column families, which they inherited from Bigtable [9]. However, it is very misleading to call them column-oriented: within each column family, they store all columns from a row together, along with a row key, and they do not use column compression. Thus, the Bigtable model is still mostly roworiented. ([Page 168](zotero://open-pdf/library/items/NUWYWJMI?page=167&annotation=NJWXF879)) ^953405291
- Besides reducing the volume of data that needs to be loaded from disk, column-oriented storage layouts are also good for making efficient use of CPU cycles. ([Page 168](zotero://open-pdf/library/items/NUWYWJMI?page=167&annotation=GHLISDWG)) ^953405292
- The difference is that a materialized view is an actual copy of the query results, written to disk, whereas a virtual view is just a shortcut for writing queries. ([Page 171](zotero://open-pdf/library/items/NUWYWJMI?page=170&annotation=MID3ARRM)) ^953405293
- Robert Escriva, Bernard Wong, and Emin Gün Sirer: “HyperDex: A Distributed, Searchable Key-Value Store,” at ACM SIGCOMM Conference, August 2012. doi:10.1145/2377677.2377681 ([Page 178](zotero://open-pdf/library/items/NUWYWJMI?page=177&annotation=3UKWJ5HY)) ^953405294
#further-reading
- Justin DeBrabant, Andrew Pavlo, Stephen Tu, et al.: “Anti-Caching: A New Approach to Database Management System Architecture,” Proceedings of the VLDB Endowment, volume 6, number 14, pages 1942–1953, September 2013. ([Page 179](zotero://open-pdf/library/items/NUWYWJMI?page=178&annotation=K3KX2KWI)) ^953405295
#further-reading
- Sergey Melnik, Andrey Gubarev, Jing Jing Long, et al.: “Dremel: Interactive Analysis of WebScale Datasets,” at 36th International Conference on Very Large Data Bases (VLDB), pages 330–339, September 2010. ([Page 180](zotero://open-pdf/library/items/NUWYWJMI?page=179&annotation=4JM3RF2F)) ^953405296
#further-reading
- Backward compatibility Newer code can read data that was written by older code. ([Page 185](zotero://open-pdf/library/items/NUWYWJMI?page=184&annotation=79DEDZEB)) ^953405297
- Forward compatibility Older code can read data that was written by newer code. ([Page 185](zotero://open-pdf/library/items/NUWYWJMI?page=184&annotation=452J7BDX)) ^953405298
- The binary encoding is 66 bytes long, which is only a little less than the 81 bytes taken by the textual JSON encoding (with whitespace removed). ([Page 190](zotero://open-pdf/library/items/NUWYWJMI?page=189&annotation=IBKRH75Q)) ^953405299
- Avro is friendlier to dynamically generated schemas. For example, say you have a relational database whose contents you want to dump to a file, and you want to use a binary format to avoid the aforementioned problems with textual formats (JSON, CSV, XML). If you use Avro, you can fairly easily generate an Avro schema (in the JSON representation we saw earlier) from the relational schema and encode the database contents using that schema, dumping it all to an Avro object container file [25]. ([Page 208](zotero://open-pdf/library/items/NUWYWJMI?page=207&annotation=8ZRGN7ED)) ^953405300
- Thrift and Avro come with RPC support included, gRPC is an RPC implementation using Protocol Buffers, Finagle also uses Thrift, and Rest.li uses JSON over HTTP. ([Page 221](zotero://open-pdf/library/items/NUWYWJMI?page=220&annotation=GKJ6BXAJ)) ^953405301
- Martin Kleppmann: “Schema Evolution in Avro, Protocol Buffers and Thrift,” martin.kleppmann.com, December 5, 2012. ([Page 229](zotero://open-pdf/library/items/NUWYWJMI?page=228&annotation=8H2XJX5Y)) ^953405302
#further-reading
- Doug Cutting, Chad Walters, Jim Kellerman, et al.: “[PROPOSAL] New Subproject: Avro,” email thread on hadoop-general mailing list, mail-archives.apache.org, April 2009. ([Page 229](zotero://open-pdf/library/items/NUWYWJMI?page=228&annotation=DM8JA2YE)) ^953405303
#further-reading
- In practice, if you enable synchronous replication on a database, it usually means that one of the followers is synchronous, and the others are asynchronous. If the synchronous follower becomes unavailable or slow, one of the asynchronous followers is made synchronous. ([Page 247](zotero://open-pdf/library/items/NUWYWJMI?page=246&annotation=G949D53Q)) ^953405304
- In this situation, we need read-after-write consistency, also known as readyour-writes consistency [24]. This is a guarantee that if the user reloads the page, they will always see any updates they submitted themselves. ([Page 260](zotero://open-pdf/library/items/NUWYWJMI?page=259&annotation=VLKFEA9Q)) ^953405305
- When reading something that the user may have modified, read it from the leader; otherwise, read it from a follower. ([Page 260](zotero://open-pdf/library/items/NUWYWJMI?page=259&annotation=QJH9HWV6)) ^953405306
- For example, you could track the time of the last update and, for one minute after the last update, make all reads from the leader. You could also monitor the replication lag on followers and prevent queries on any follower that is more than one minute behind the leader. ([Page 260](zotero://open-pdf/library/items/NUWYWJMI?page=259&annotation=BTVQRRRE)) ^953405307
- The client can remember the timestamp of its most recent write—then the system can ensure that the replica serving any reads for that user reflects updates at least until that timestamp. ([Page 260](zotero://open-pdf/library/items/NUWYWJMI?page=259&annotation=G7MBX5C6)) ^953405308
- When you read data, you may see an old value; monotonic reads only means that if one user makes several reads in sequence, they will not see time go backward—i.e., they will not read older data after having previously read newer data. ([Page 264](zotero://open-pdf/library/items/NUWYWJMI?page=263&annotation=D72UAK9Y)) ^953405309
- Since many implementations of multi-leader replication handle conflicts quite poorly, avoiding conflicts is a frequently recommended approach [34]. ([Page 276](zotero://open-pdf/library/items/NUWYWJMI?page=275&annotation=TEBPHLYS)) ^953405310
- Some CRDTs have been implemented in Riak 2.0 [39, 40]. ([Page 279](zotero://open-pdf/library/items/NUWYWJMI?page=278&annotation=SSPJV9HN)) ^953405311
- Mergeable persistent data structures [41] track history explicitly, similarly to the Git version control system, and use a three-way merge function (whereas CRDTs use two-way merges). ([Page 279](zotero://open-pdf/library/items/NUWYWJMI?page=278&annotation=QB4U7GDL)) ^953405312
- Operational transformation [42] is the conflict resolution algorithm behind collaborative editing applications such as Etherpad [30] and Google Docs [31]. ([Page 279](zotero://open-pdf/library/items/NUWYWJMI?page=278&annotation=ECLI5LPV)) ^953405313
- Note that without an anti-entropy process, values that are rarely read may be missing from some replicas and thus have reduced durability, because read repair is only performed when a value is read by the application. ([Page 288](zotero://open-pdf/library/items/NUWYWJMI?page=287&annotation=3VJTRPGW)) ^953405314
- More generally, if there are n replicas, every write must be confirmed by w nodes to be considered successful, and we must query at least r nodes for each read. (In our example, n = 3, w = 2, r = 2.) As long as w + r > n, we expect to get an up-to-date value when reading, because at least one of the r nodes we’re reading from must be up to date. Reads and writes that obey these r and w values are called quorum reads and writes [44]. ([Page 288](zotero://open-pdf/library/items/NUWYWJMI?page=287&annotation=EJWSTM4U)) ^953405315
- From an operational perspective, it’s important to monitor whether your databases are returning up-to-date results. Even if your application can tolerate stale reads, you need to be aware of the health of your replication. ([Page 293](zotero://open-pdf/library/items/NUWYWJMI?page=292&annotation=9KEYP2ZN)) ^953405316
- Sloppy quorums are optional in all common Dynamo implementations. In Riak they are enabled by default, and in Cassandra and Voldemort they are disabled by default [46, 49, 50]. ([Page 295](zotero://open-pdf/library/items/NUWYWJMI?page=294&annotation=YJGRXCPK)) ^953405317
- If losing data is not acceptable, LWW is a poor choice for conflict resolution. ([Page 299](zotero://open-pdf/library/items/NUWYWJMI?page=298&annotation=CLXS45EW)) ^953405318
- The collection of version numbers from all the replicas is called a version vector [56]. A few variants of this idea are in use, but the most interesting is probably the dotted version vector [57], which is used in Riak 2.0 [58, 59]. ([Page 309](zotero://open-pdf/library/items/NUWYWJMI?page=308&annotation=LRCP4949)) ^953405319
- If the partitioning is unfair, so that some partitions have more data or queries than others, we call it skewed. ([Page 323](zotero://open-pdf/library/items/NUWYWJMI?page=322&annotation=DLZG6QGL)) ^953405320
- For partitioning purposes, the hash function need not be cryptographically strong: for example, MongoDB uses MD5, Cassandra uses Murmur3, and Voldemort uses the Fowler–Noll–Vo function. ([Page 327](zotero://open-pdf/library/items/NUWYWJMI?page=326&annotation=ZP3A2NSW)) ^953405321
- Consistent hashing, as defined by Karger et al. [7], is a way of evenly distributing load across an internet-wide system of caches such as a content delivery network (CDN). It uses randomly chosen partition boundaries to avoid the need for central control or distributed consensus. ([Page 329](zotero://open-pdf/library/items/NUWYWJMI?page=328&annotation=BG6QARI3)) ^953405322
- As we shall see in “Rebalancing Partitions”, this particular approach actually doesn’t work very well for databases [8], so it is rarely used in practice (the documentation of some databases still refers to consistent hashing, but it is often inaccurate) ([Page 329](zotero://open-pdf/library/items/NUWYWJMI?page=328&annotation=FF67H3X8)) ^953405323
#caveat
- In this indexing approach, each partition is completely separate: each partition maintains its own secondary indexes, covering only the documents in that partition. It doesn’t care what data is stored in other partitions. ([Page 334](zotero://open-pdf/library/items/NUWYWJMI?page=333&annotation=X8NSF677)) ^953405324
- Nevertheless, it is widely used: MongoDB, Riak [15], Cassandra [16], Elasticsearch [17], SolrCloud [18], and VoltDB [19] all use documentpartitioned secondary indexes. ([Page 334](zotero://open-pdf/library/items/NUWYWJMI?page=333&annotation=FNCJM7SN)) ^953405325
- Rather than each partition having its own secondary index (a local index), we can construct a global index that covers data in all partitions. ([Page 336](zotero://open-pdf/library/items/NUWYWJMI?page=335&annotation=8Z7FP254)) ^953405326
It's like creating a new table to create the secondary index.
- However, the downside of a global index is that writes are slower and more complicated, because a write to a single document may now affect multiple partitions of the index (every term in the document might be on a different partition, on a different node). ([Page 336](zotero://open-pdf/library/items/NUWYWJMI?page=335&annotation=ZJ7RK8E7)) ^953405327
- In practice, updates to global secondary indexes are often asynchronous (that is, if you read the index shortly after a write, the change you just made may not yet be reflected in the index). ([Page 337](zotero://open-pdf/library/items/NUWYWJMI?page=336&annotation=4JE86RJ5)) ^953405328
- The problem with the mod N approach is that if the number of nodes N changes, most of the keys will need to be moved from one node to another. ([Page 338](zotero://open-pdf/library/items/NUWYWJMI?page=337&annotation=ZQSINSST)) ^953405329
- Fortunately, there is a fairly simple solution: create many more partitions than there are nodes, and assign several partitions to each node. ([Page 339](zotero://open-pdf/library/items/NUWYWJMI?page=338&annotation=9TLCDCFH)) ^953405330
- A third option, used by Cassandra and Ketama, is to make the number of partitions proportional to the number of nodes—in other words, to have a fixed number of partitions per node [23, 27, 28]. ([Page 343](zotero://open-pdf/library/items/NUWYWJMI?page=342&annotation=V2ECLF4T)) ^953405331
- Fully automated rebalancing can be convenient, because there is less operational work to do for normal maintenance. However, it can be unpredictable. ([Page 344](zotero://open-pdf/library/items/NUWYWJMI?page=343&annotation=IT8PE32P)) ^953405332
- Many distributed data systems rely on a separate coordination service such as ZooKeeper to keep track of this cluster metadata, as illustrated in Figure 6-8. ([Page 347](zotero://open-pdf/library/items/NUWYWJMI?page=346&annotation=QM8IF4EG)) ^953405333
- Cassandra and Riak take a different approach: they use a gossip protocol among the nodes to disseminate any changes in cluster state. Requests can be sent to any node, and that node forwards them to the appropriate node for the requested partition (approach 1 in Figure 6-7). ([Page 349](zotero://open-pdf/library/items/NUWYWJMI?page=348&annotation=YJQWWKVQ)) ^953405334
- A transaction is a way for an application to group several reads and writes together into a logical unit. Conceptually, all the reads and writes in a transaction are executed as one operation: either the entire transaction succeeds (commit) or it fails (abort, rollback). ([Page 357](zotero://open-pdf/library/items/NUWYWJMI?page=356&annotation=FVJ63RBW)) ^953405335
#definition
- Almost all relational databases today, and some nonrelational databases, support transactions. Most of them follow the style that was introduced in 1975 by IBM System R, the first SQL database [1, 2, 3]. ([Page 358](zotero://open-pdf/library/items/NUWYWJMI?page=357&annotation=FEDPB5BQ)) ^953405336
- The safety guarantees provided by transactions are often described by the well-known acronym ACID, which stands for Atomicity, Consistency, Isolation, and Durability. It was coined in 1983 by Theo Härder and Andreas Reuter [7] in an effort to establish precise terminology for faulttolerance mechanisms in databases. ([Page 359](zotero://open-pdf/library/items/NUWYWJMI?page=358&annotation=Q83KYMN7)) ^953405337
- The high-level idea is sound, but the devil is in the details. Today, when a system claims to be “ACID compliant,” it’s unclear what guarantees you can actually expect. ACID has unfortunately become mostly a marketing term. ([Page 359](zotero://open-pdf/library/items/NUWYWJMI?page=358&annotation=7RBM4N2X)) ^953405338
- Systems that do not meet the ACID criteria are sometimes called BASE, which stands for Basically Available, Soft state, and Eventual consistency [9]. ([Page 360](zotero://open-pdf/library/items/NUWYWJMI?page=359&annotation=6Z8ZPUYU)) ^953405339
- The ability to abort a transaction on error and have all writes from that transaction discarded is the defining feature of ACID atomicity. ([Page 361](zotero://open-pdf/library/items/NUWYWJMI?page=360&annotation=XL4XH5QY)) ^953405340
#definition
- The idea of ACID consistency is that you have certain statements about your data (invariants) that must always be true—for example, in an accounting system, credits and debits across all accounts must always be balanced. If a transaction starts with a database that is valid according to these invariants, and any writes during the transaction preserve the validity, then you can be sure that the invariants are always satisfied. ([Page 361](zotero://open-pdf/library/items/NUWYWJMI?page=360&annotation=39PHG9NF)) ^953405341
#definition
- Atomicity, isolation, and durability are properties of the database, whereas consistency (in the ACID sense) is a property of the application. ([Page 362](zotero://open-pdf/library/items/NUWYWJMI?page=361&annotation=EBJXQW7H)) ^953405342
- Isolation in the sense of ACID means that concurrently executing transactions are isolated from each other: they cannot step on each other’s toes. The classic database textbooks formalize isolation as serializability, which means that each transaction can pretend that it is the only transaction running on the entire database. ([Page 362](zotero://open-pdf/library/items/NUWYWJMI?page=361&annotation=JFM2E5MV)) ^953405343
#definition
- In Oracle there is an isolation level called “serializable,” but it actually implements something called snapshot isolation, which is a weaker guarantee than serializability [8, 11]. ([Page 364](zotero://open-pdf/library/items/NUWYWJMI?page=363&annotation=RRBBC76E)) ^953405344
- Durability is the promise that once a transaction has committed successfully, any data it has written will not be forgotten, even if there is a hardware fault or the database crashes. ([Page 364](zotero://open-pdf/library/items/NUWYWJMI?page=363&annotation=RNGL6N6H)) ^953405345
#definition
- In practice, there is no one technique that can provide absolute guarantees. There are only various risk-reduction techniques, including writing to disk, replicating to remote machines, and backups—and they can and should be used together. As always, it’s wise to take any theoretical “guarantees” with a healthy grain of salt. ([Page 366](zotero://open-pdf/library/items/NUWYWJMI?page=365&annotation=L3U9R6R5)) ^953405346
- Atomicity If an error occurs halfway through a sequence of writes, the transaction should be aborted, and the writes made up to that point should be discarded. In other words, the database saves you from having to worry about partial failure, by giving an all-or-nothing guarantee. ([Page 366](zotero://open-pdf/library/items/NUWYWJMI?page=365&annotation=WXCCDWPD)) ^953405347
#definition
- Isolation Concurrently running transactions shouldn’t interfere with each other. For example, if one transaction makes several writes, then another transaction should see either all or none of those writes, but not some subset. ([Page 366](zotero://open-pdf/library/items/NUWYWJMI?page=365&annotation=QZQCBGHI)) ^953405348
#definition
- Atomicity can be implemented using a log for crash recovery (see “Making B-trees reliable”), and isolation can be implemented using a lock on each object (allowing only one thread to access an object at any one time). ([Page 371](zotero://open-pdf/library/items/NUWYWJMI?page=370&annotation=7MJA5HVT)) ^953405349
- Compare-and-set and other single-object operations have been dubbed “lightweight transactions” or even “ACID” for marketing purposes [20, 21, 22], but that terminology is misleading. A transaction is usually understood as a mechanism for grouping multiple operations on multiple objects into one unit of execution. ([Page 372](zotero://open-pdf/library/items/NUWYWJMI?page=371&annotation=2MR56S5L)) ^953405350
- Even many popular relational database systems (which are usually considered “ACID”) use weak isolation, so they wouldn’t necessarily have prevented these bugs from occurring. ([Page 376](zotero://open-pdf/library/items/NUWYWJMI?page=375&annotation=2JRJ6SNA)) ^953405351
- Our discussion of isolation levels will be informal, using examples. If you want rigorous definitions and analyses of their properties, you can find them in the academic literature [28, 29, 30]. ([Page 376](zotero://open-pdf/library/items/NUWYWJMI?page=375&annotation=ST45LJ4S)) ^953405352
- Transactions running at the read committed isolation level must prevent dirty reads. This means that any writes by a transaction only become visible to others when that transaction commits (and then all of its writes become visible at once). ([Page 377](zotero://open-pdf/library/items/NUWYWJMI?page=376&annotation=ZE7H7C84)) ^953405353
- Transactions running at the read committed isolation level must prevent dirty writes, usually by delaying the second write until the first write’s transaction has committed or aborted. ([Page 379](zotero://open-pdf/library/items/NUWYWJMI?page=378&annotation=WV2CQFEE)) ^953405354
- Most commonly, databases prevent dirty writes by using row-level locks: when a transaction wants to modify a particular object (row or document), it must first acquire a lock on that object. It must then hold that lock until the transaction is committed or aborted. ([Page 382](zotero://open-pdf/library/items/NUWYWJMI?page=381&annotation=BVPVSQC9)) ^953405355
- For that reason, most databases prevent dirty reads using the approach illustrated in Figure 7-4: for every object that is written, the database remembers both the old committed value and the new value set by the transaction that currently holds the write lock. ([Page 382](zotero://open-pdf/library/items/NUWYWJMI?page=381&annotation=LPK8NZDD)) ^953405356
- Snapshot isolation [28] is the most common solution to this problem. The idea is that each transaction reads from a consistent snapshot of the database—that is, the transaction sees all the data that was committed in the database at the start of the transaction. ([Page 386](zotero://open-pdf/library/items/NUWYWJMI?page=385&annotation=QPMV6BC7)) ^953405357
- To implement snapshot isolation, databases use a generalization of the mechanism we saw for preventing dirty reads in Figure 7-4. The database must potentially keep several different committed versions of an object, because various in-progress transactions may need to see the state of the database at different points in time. Because it maintains several versions of an object side by side, this technique is known as multi-version concurrency control (MVCC). ([Page 387](zotero://open-pdf/library/items/NUWYWJMI?page=386&annotation=GARN58U9)) ^953405358
- An update is internally translated into a delete and a create. For example, in Figure 7-7, transaction 13 deducts $100 from account 2, changing the balance from $500 to $400. The accounts table now actually contains two rows for account 2: a row with a balance of $500 which was marked as deleted by transaction 13, and a row with a balance of $400 which was created by transaction 13. ([Page 389](zotero://open-pdf/library/items/NUWYWJMI?page=388&annotation=RQ23MLNP)) ^953405359
- With append-only B-trees, every write transaction (or batch of transactions) creates a new B-tree root, and a particular root is a consistent snapshot of the database at the point in time when it was created. There is no need to filter out objects based on transaction IDs because subsequent writes cannot modify an existing B-tree; ([Page 391](zotero://open-pdf/library/items/NUWYWJMI?page=390&annotation=96NCPHUH)) ^953405360
- Atomic operations and locks are ways of preventing lost updates by forcing the read-modify-write cycles to happen sequentially. An alternative is to allow them to execute in parallel and, if the transaction manager detects a lost update, abort the transaction and force it to retry its read-modify-write cycle. ([Page 394](zotero://open-pdf/library/items/NUWYWJMI?page=393&annotation=C4E8JKGS)) ^953405361
- An advantage of this approach is that databases can perform this check efficiently in conjunction with snapshot isolation. ([Page 395](zotero://open-pdf/library/items/NUWYWJMI?page=394&annotation=2KFHBTXR)) ^953405362
- However, if the database allows the WHERE clause to read from an old snapshot, this statement may not prevent lost updates ([Page 395](zotero://open-pdf/library/items/NUWYWJMI?page=394&annotation=I6R7HWPR)) ^953405363
#caveat
- You can think of write skew as a generalization of the lost update problem. Write skew can occur if two transactions read the same objects, and then update some of those objects (different transactions may update different objects). ([Page 399](zotero://open-pdf/library/items/NUWYWJMI?page=398&annotation=PNXFSSST)) ^953405364
- Automatically preventing write skew requires true serializable isolation (see “Serializability”). ([Page 400](zotero://open-pdf/library/items/NUWYWJMI?page=399&annotation=46KNI25W)) ^953405365
- If you can’t use a serializable isolation level, the second-best option in this case is probably to explicitly lock the rows that the transaction depends on. ([Page 400](zotero://open-pdf/library/items/NUWYWJMI?page=399&annotation=FKR28BVB)) ^953405366
- This effect, where a write in one transaction changes the result of a search query in another transaction, is called a phantom [3]. ([Page 403](zotero://open-pdf/library/items/NUWYWJMI?page=402&annotation=KKA529JD)) ^953405367
- If the problem of phantoms is that there is no object to which we can attach the locks, perhaps we can artificially introduce a lock object into the database? ([Page 403](zotero://open-pdf/library/items/NUWYWJMI?page=402&annotation=JGWH83LL)) ^953405368
- This approach is called materializing conflicts, because it takes a phantom and turns it into a lock conflict on a concrete set of rows that exist in the database [11]. ([Page 404](zotero://open-pdf/library/items/NUWYWJMI?page=403&annotation=8LT5HY4Q)) ^953405369
- Serializable isolation is usually regarded as the strongest isolation level. It guarantees that even though transactions may execute in parallel, the end result is the same as if they had executed one at a time, serially, without any concurrency. ([Page 405](zotero://open-pdf/library/items/NUWYWJMI?page=404&annotation=GCFCK7PD)) ^953405370
- The approach of executing transactions serially is implemented in VoltDB/H-Store, Redis, and Datomic [46, 47, 48]. A system designed for single-threaded execution can sometimes perform better than a system that supports concurrency, because it can avoid the coordination overhead of locking. ([Page 406](zotero://open-pdf/library/items/NUWYWJMI?page=405&annotation=7H2HM84F)) ^953405371
- If a transaction needs to use the current date and time, for example, it must do so through special deterministic APIs. ([Page 411](zotero://open-pdf/library/items/NUWYWJMI?page=410&annotation=RIV9537A)) ^953405372
- For around 30 years, there was only one widely used algorithm for serializability in databases: two-phase locking (2PL). ([Page 412](zotero://open-pdf/library/items/NUWYWJMI?page=411&annotation=GDQSAB5L)) ^953405373
- In 2PL, writers don’t just block other writers; they also block readers and vice versa. ([Page 413](zotero://open-pdf/library/items/NUWYWJMI?page=412&annotation=MHT3N9KJ)) ^953405374
- The lock can either be in shared mode or in exclusive mode. ([Page 413](zotero://open-pdf/library/items/NUWYWJMI?page=412&annotation=ZWS4LUFL)) ^953405375
read-write lock
- Conceptually, we need a predicate lock [3]. It works similarly to the shared/exclusive lock described earlier, but rather than belonging to a particular object (e.g., one row in a table), it belongs to all objects that match some search condition ([Page 416](zotero://open-pdf/library/items/NUWYWJMI?page=415&annotation=U7335YDY)) ^953405376
- The key idea here is that a predicate lock applies even to objects that do not yet exist in the database, but which might be added in the future (phantoms). ([Page 416](zotero://open-pdf/library/items/NUWYWJMI?page=415&annotation=ZZ54ZMBN)) ^953405377
- It’s safe to simplify a predicate by making it match a greater set of objects. ([Page 417](zotero://open-pdf/library/items/NUWYWJMI?page=416&annotation=2WLFBZ5S)) ^953405378
- If there is no suitable index where a range lock can be attached, the database can fall back to a shared lock on the entire table. ([Page 418](zotero://open-pdf/library/items/NUWYWJMI?page=417&annotation=FLEE99ZK)) ^953405379
- Perhaps not: an algorithm called serializable snapshot isolation (SSI) is very promising. It provides full serializability, but has only a small performance penalty compared to snapshot isolation. SSI is fairly new: it was first described in 2008 [40] and is the subject of Michael Cahill’s PhD thesis [51]. ([Page 418](zotero://open-pdf/library/items/NUWYWJMI?page=417&annotation=CQAJHN7G)) ^953405380
#starred
- Today SSI is used both in single-node databases (the serializable isolation level in PostgreSQL since version 9.1 [41]) and distributed databases (FoundationDB uses a similar algorithm). ([Page 418](zotero://open-pdf/library/items/NUWYWJMI?page=417&annotation=9QPH8GGV)) ^953405381
- By contrast, serializable snapshot isolation is an optimistic concurrency control technique. Optimistic in this context means that instead of blocking if something potentially dangerous happens, transactions continue anyway, in the hope that everything will turn out all right. When a transaction wants to commit, the database checks whether anything bad happened (i.e., whether isolation was violated); if so, the transaction is aborted and has to be retried. ([Page 419](zotero://open-pdf/library/items/NUWYWJMI?page=418&annotation=C8JHMR5L)) ^953405382
- SSI adds an algorithm for detecting serialization conflicts among writes and determining which transactions to abort. ([Page 420](zotero://open-pdf/library/items/NUWYWJMI?page=419&annotation=7MZH46UG)) ^953405383
- In order to prevent this anomaly, the database needs to track when a transaction ignores another transaction’s writes due to MVCC visibility rules. When the transaction wants to commit, the database checks whether any of the ignored writes have now been committed. If so, the transaction must be aborted. ([Page 423](zotero://open-pdf/library/items/NUWYWJMI?page=422&annotation=GM8GBYE4)) ^953405384
- When a transaction writes to the database, it must look in the indexes for any other transactions that have recently read the affected data. This process is similar to acquiring a write lock on the affected key range, but rather than blocking until the readers have committed, the lock acts as a tripwire: it simply notifies the transactions that the data they read may no longer be up to date. ([Page 425](zotero://open-pdf/library/items/NUWYWJMI?page=424&annotation=BMG579N6)) ^953405385
- FoundationDB distributes the detection of serialization conflicts across multiple machines, allowing it to scale to very high throughput. Even though data may be partitioned across multiple machines, transactions can read and write data in multiple partitions while ensuring serializable isolation [54]. ([Page 426](zotero://open-pdf/library/items/NUWYWJMI?page=425&annotation=8BZEU7JU)) ^953405386
- It found that adding redundant networking gear doesn’t reduce faults as much as you might hope, since it doesn’t guard against human error (e.g., misconfigured switches), which is a major cause of outages. ([Page 447](zotero://open-pdf/library/items/NUWYWJMI?page=446&annotation=NRWXX3FV)) ^953405387
- This can be done with a Phi Accrual failure detector [30], which is used for example in Akka and Cassandra [31]. ([Page 455](zotero://open-pdf/library/items/NUWYWJMI?page=454&annotation=GP53MSN2)) ^953405388
- Variable delays in networks are not a law of nature, but simply the result of a cost/benefit trade-off. ([Page 459](zotero://open-pdf/library/items/NUWYWJMI?page=458&annotation=AINP6F9R)) ^953405389
- One experiment showed that a minimum error of 35 ms is achievable when synchronizing over the internet [42], though occasional spikes in network delay lead to errors of around a second. Depending on the configuration, large network delays can cause the NTP client to give up entirely. ([Page 463](zotero://open-pdf/library/items/NUWYWJMI?page=462&annotation=HX9WF4LE)) ^953405390
- When a CPU core is shared between virtual machines, each VM is paused for tens of milliseconds while another VM is running. From an application’s point of view, this pause manifests itself as the clock suddenly jumping forward [26]. ([Page 463](zotero://open-pdf/library/items/NUWYWJMI?page=462&annotation=5X9WG9IG)) ^953405391
- Such accuracy can be achieved using GPS receivers, the Precision Time Protocol (PTP) [52], and careful deployment and monitoring. ([Page 464](zotero://open-pdf/library/items/NUWYWJMI?page=463&annotation=H82Q5GLA)) ^953405392
- An interesting exception is Google’s TrueTime API in Spanner [41], which explicitly reports the confidence interval on the local clock. ([Page 469](zotero://open-pdf/library/items/NUWYWJMI?page=468&annotation=P74XFDH2)) ^953405393
- In order to ensure that transaction timestamps reflect causality, Spanner deliberately waits for the length of the confidence interval before committing a read-write transaction. By doing so, it ensures that any transaction that may read the data is at a sufficiently later time, so their confidence intervals do not overlap. ([Page 471](zotero://open-pdf/library/items/NUWYWJMI?page=470&annotation=TINCXDM5)) ^953405394
- A node in a distributed system must assume that its execution can be paused for a significant length of time at any point, even in the middle of a function. ([Page 474](zotero://open-pdf/library/items/NUWYWJMI?page=473&annotation=8MIBIZFJ)) ^953405395
- Moreover, “realtime” is not the same as “high-performance”—in fact, real-time systems may have lower throughput, since they have to prioritize timely responses above all else (see also “Latency and Resource Utilization”). ([Page 476](zotero://open-pdf/library/items/NUWYWJMI?page=475&annotation=8FBHIRJM)) ^953405396
- A distributed system cannot exclusively rely on a single node, because a node may fail at any time, potentially leaving the system stuck and unable to recover. ([Page 479](zotero://open-pdf/library/items/NUWYWJMI?page=478&annotation=Y9GYSFZI)) ^953405397
- Let’s assume that every time the lock server grants a lock or lease, it also returns a fencing token, which is a number that increases every time a lock is granted (e.g., incremented by the lock service). We can then require that every time a client sends a write request to the storage service, it must include its current fencing token. ([Page 484](zotero://open-pdf/library/items/NUWYWJMI?page=483&annotation=THH278WL)) ^953405398
- If ZooKeeper is used as lock service, the transaction ID zxid or the node version cversion can be used as fencing token. Since they are guaranteed to be monotonically increasing, they have the required properties [74]. ([Page 484](zotero://open-pdf/library/items/NUWYWJMI?page=483&annotation=25VW7QSM)) ^953405399
- For modeling real systems, the partially synchronous model with crashrecovery faults is generally the most useful model. ([Page 491](zotero://open-pdf/library/items/NUWYWJMI?page=490&annotation=596WCPTZ)) ^953405400
- Safety is often informally defined as nothing bad happens, and liveness as something good eventually happens. ([Page 492](zotero://open-pdf/library/items/NUWYWJMI?page=491&annotation=L5NXE63E)) ^953405401
- Naohiro Hayashibara, Xavier Défago, Rami Yared, and Takuya Katayama: “The φ Accrual Failure Detector,” Japan Advanced Institute of Science and Technology, School of Information Science, Technical Report IS-RR-2004-010, May 2004. ([Page 499](zotero://open-pdf/library/items/NUWYWJMI?page=498&annotation=4TRBEZ9C)) ^953405402
#further-reading
- In a linearizable system we imagine that there must be some point in time (between the start and end of the write operation) at which the value of x atomically flips from 0 to 1. Thus, if one client’s read returns the new value 1, all subsequent reads must also return the new value, even if the write operation has not yet completed. ([Page 517](zotero://open-pdf/library/items/NUWYWJMI?page=516&annotation=TUXUKG2B)) ^953405403
- In summary, it is safest to assume that a leaderless system with Dynamostyle replication does not provide linearizability. ([Page 531](zotero://open-pdf/library/items/NUWYWJMI?page=530&annotation=AMX7WVE5)) ^953405404
- At times when the network is working correctly, a system can provide both consistency (linearizability) and total availability. When a network fault occurs, you have to choose between either linearizability or total availability. ([Page 536](zotero://open-pdf/library/items/NUWYWJMI?page=535&annotation=UPZXIPRT)) ^953405405
- Thus, a better way of phrasing CAP would be either Consistent or Available when Partitioned [39]. ([Page 536](zotero://open-pdf/library/items/NUWYWJMI?page=535&annotation=Y87DCZ96)) ^953405406
- For example, even RAM on a modern multi-core CPU is not linearizable [43]: if a thread running on one CPU core writes to a memory address, and a thread on another CPU core reads the same address shortly afterward, it is not guaranteed to read the value written by the first thread (unless a memory barrier or fence [44] is used). ([Page 537](zotero://open-pdf/library/items/NUWYWJMI?page=536&annotation=DVVANQWQ)) ^953405407
- Linearizability is slow—and this is true all the time, not only during a network fault. ([Page 537](zotero://open-pdf/library/items/NUWYWJMI?page=536&annotation=G3LH8FE7)) ^953405408
- Attiya and Welch [47] prove that if you want linearizability, the response time of read and write requests is at least proportional to the uncertainty of delays in the network. ([Page 537](zotero://open-pdf/library/items/NUWYWJMI?page=536&annotation=JGUIC9ME)) ^953405409
- If a system obeys the ordering imposed by causality, we say that it is causally consistent. ([Page 541](zotero://open-pdf/library/items/NUWYWJMI?page=540&annotation=XI5PPENS)) ^953405410
- In fact, causal consistency is the strongest possible consistency model that does not slow down due to network delays, and remains available in the face of network failures [2, 42]. ([Page 543](zotero://open-pdf/library/items/NUWYWJMI?page=542&annotation=4NZPZDPK)) ^953405411
- The Lamport timestamp is then simply a pair of (counter, node ID). Two nodes may sometimes have the same counter value, but by including the node ID in the timestamp, each timestamp is made unique. ([Page 547](zotero://open-pdf/library/items/NUWYWJMI?page=546&annotation=HFC3FRB8)) ^953405412
- When a node receives a request or response with a maximum counter value greater than its own counter value, it immediately increases its own counter to that maximum. ([Page 549](zotero://open-pdf/library/items/NUWYWJMI?page=548&annotation=GVRWK9MH)) ^953405413
- This approach works for determining the winner after the fact: once you have collected all the username creation operations in the system, you can compare their timestamps. However, this approach is not sufficient when a node has just received a request from a user to create a username, and needs to decide right now whether the request should succeed or fail. ([Page 550](zotero://open-pdf/library/items/NUWYWJMI?page=549&annotation=UXEESWI5)) ^953405414
Similar to the issue in CKB to resolve conflicts via off-chain indexing consensus that the cell created earliest wins.
- To conclude: in order to implement something like a uniqueness constraint for usernames, it’s not sufficient to have a total ordering of operations—you also need to know when that order is finalized. ([Page 551](zotero://open-pdf/library/items/NUWYWJMI?page=550&annotation=IIDUXQUQ)) ^953405415
- In the distributed systems literature, this problem is known as total order broadcast or atomic broadcast [25, 57, 58].ix ([Page 551](zotero://open-pdf/library/items/NUWYWJMI?page=550&annotation=XI2IPFN9)) ^953405416
- In general, if you think hard enough about linearizable sequence number generators, you inevitably end up with a consensus algorithm. ([Page 556](zotero://open-pdf/library/items/NUWYWJMI?page=555&annotation=9BFHEX6V)) ^953405417
- This is no coincidence: it can be proved that a linearizable compare-and-set (or increment-and-get) register and total order broadcast are both equivalent to consensus [28, 67]. ([Page 556](zotero://open-pdf/library/items/NUWYWJMI?page=555&annotation=5SWD3IIG)) ^953405418
- Once the coordinator’s decision has been written to disk, the commit or abort request is sent to all participants. If this request fails or times out, the coordinator must retry forever until it succeeds. There is no more going back: if the decision was to commit, that decision must be enforced, no matter how many retries it takes. ([Page 565](zotero://open-pdf/library/items/NUWYWJMI?page=564&annotation=C8MY9FSI)) ^953405419
- X/Open XA (short for eXtended Architecture) is a standard for implementing two-phase commit across heterogeneous technologies [76, 77]. It was introduced in 1991 and has been widely implemented: XA is supported by many traditional relational databases (including PostgreSQL, MySQL, DB2, SQL Server, and Oracle) and message brokers (including ActiveMQ, HornetQ, MSMQ, and IBM MQ). ([Page 571](zotero://open-pdf/library/items/NUWYWJMI?page=570&annotation=EPW7I6G8)) ^953405420
- The transaction coordinator implements the XA API. The standard does not specify how it should be implemented, but in practice the coordinator is often simply a library that is loaded into the same process as the application issuing the transaction (not a separate service). ([Page 571](zotero://open-pdf/library/items/NUWYWJMI?page=570&annotation=6HS5IV5J)) ^953405421
- The uniform agreement and integrity properties define the core idea of consensus: everyone decides on the same outcome, and once you have decided, you cannot change your mind. ([Page 576](zotero://open-pdf/library/items/NUWYWJMI?page=575&annotation=K9JGFPZY)) ^953405422
- The validity property exists mostly to rule out trivial solutions: for example, you could have an algorithm that always decides null, no matter what was proposed; this algorithm would satisfy the agreement and integrity properties, but not the validity property. ([Page 576](zotero://open-pdf/library/items/NUWYWJMI?page=575&annotation=IB8IMNH2)) ^953405423
- The termination property formalizes the idea of fault tolerance. It essentially says that a consensus algorithm cannot simply sit around and do nothing forever—in other words, it must make progress. ([Page 576](zotero://open-pdf/library/items/NUWYWJMI?page=575&annotation=EYV7C2YX)) ^953405424
- The best-known fault-tolerant consensus algorithms are Viewstamped Replication (VSR) [94, 95], Paxos [96, 97, 98, 99], Raft [22, 100, 101], and Zab [15, 21, 102]. ([Page 577](zotero://open-pdf/library/items/NUWYWJMI?page=576&annotation=I38K6VQS)) ^953405425
- total order broadcast is equivalent to repeated rounds of consensus ([Page 578](zotero://open-pdf/library/items/NUWYWJMI?page=577&annotation=TNQEYYVG)) ^953405426
- The process by which nodes vote on proposals before they are decided is a kind of synchronous replication. ([Page 581](zotero://open-pdf/library/items/NUWYWJMI?page=580&annotation=36FGSR4G)) ^953405427
- Sometimes, consensus algorithms are particularly sensitive to network problems. For example, Raft has been shown to have unpleasant edge cases [106]: if the entire network is working correctly except for one particular network link that is consistently unreliable, Raft can get into situations where leadership continually bounces between two nodes, or the current leader is continually forced to resign, so the system effectively never makes progress. ([Page 582](zotero://open-pdf/library/items/NUWYWJMI?page=581&annotation=QBZ6QSIV)) ^953405428
#caveat
- Christian Cachin, Rachid Guerraoui, and Luís Rodrigues: Introduction to Reliable and Secure Distributed Programming, 2nd edition. Springer, 2011. ISBN: 978-3-642-15259-7, doi:10.1007/978-3-642-15260-3 ([Page 593](zotero://open-pdf/library/items/NUWYWJMI?page=592&annotation=XGRTQH53)) ^953405429
#future-reading
- Miguel Castro and Barbara H. Liskov: “Practical Byzantine Fault Tolerance and Proactive Recovery,” ACM Transactions on Computer Systems, volume 20, number 4, pages 396–461, November 2002. doi:10.1145/571637.571640 ([Page 597](zotero://open-pdf/library/items/NUWYWJMI?page=596&annotation=AYWNEV5B)) ^953405430
#future-reading
- Replication may mean simply several copies of the same data on multiple machines, as in Chapter 5, or an erasure coding scheme such as Reed–Solomon codes, which allows lost data to be recovered with lower storage overhead than full replication [20, 22]. ([Page 617](zotero://open-pdf/library/items/NUWYWJMI?page=616&annotation=5Q7G2KSN)) ^953405431
- Avro (see “Avro”) and Parquet (see “ColumnOriented Storage”) are often used, as they provide efficient schema-based encoding and allow evolution of their schemas over time (see Chapter 4). ([Page 641](zotero://open-pdf/library/items/NUWYWJMI?page=640&annotation=L88GERSL)) ^953405432
- And this is why MapReduce is designed to tolerate frequent unexpected task termination: it’s not because the hardware is particularly unreliable, it’s because the freedom to arbitrarily terminate processes enables better resource utilization in a computing cluster. ([Page 646](zotero://open-pdf/library/items/NUWYWJMI?page=645&annotation=D4DWFTLQ)) ^953405433
- In general, a “stream” refers to data that is incrementally made available over time. ([Page 673](zotero://open-pdf/library/items/NUWYWJMI?page=672&annotation=62MGYZVL)) ^953405434
- Why can we not have a hybrid, combining the durable storage approach of databases with the low-latency notification facilities of messaging? This is the idea behind log-based message brokers. ([Page 685](zotero://open-pdf/library/items/NUWYWJMI?page=684&annotation=I5F6P33N)) ^953405435
Time-series Database
- Apache Kafka [17, 18], Amazon Kinesis Streams [19], and Twitter’s DistributedLog [20, 21] are log-based message brokers that work like this. ([Page 687](zotero://open-pdf/library/items/NUWYWJMI?page=686&annotation=GGBU72PW)) ^953405436
#future-reference
- More recently, there has been growing interest in change data capture (CDC), which is the process of observing all data changes written to a database and extracting them in a form in which they can be replicated to other systems. CDC is especially interesting if changes are made available as a stream, immediately as they are written. ([Page 696](zotero://open-pdf/library/items/NUWYWJMI?page=695&annotation=MWRQEIZ6)) ^953405437
- LinkedIn’s Databus [25], Facebook’s Wormhole [26], and Yahoo!’s Sherpa [27] use this idea at large scale. Bottled Water implements CDC for PostgreSQL using an API that decodes the write-ahead log [28], Maxwell and Debezium do something similar for MySQL by parsing the binlog [29, 30, 31], Mongoriver reads the MongoDB oplog [32, 33], and GoldenGate provides similar facilities for Oracle [34, 35]. The Kafka Connect framework offers further CDC connectors for various databases. ([Page 698](zotero://open-pdf/library/items/NUWYWJMI?page=697&annotation=HYCH9HA5)) ^953405438
#futher-reference
- Kafka Connect [41] is an effort to integrate change data capture tools for a wide range of database systems with Kafka. ([Page 701](zotero://open-pdf/library/items/NUWYWJMI?page=700&annotation=LLK5TYCU)) ^953405439
#futher-reference
- The event sourcing philosophy is careful to distinguish between events and commands [48]. ([Page 703](zotero://open-pdf/library/items/NUWYWJMI?page=702&annotation=MK8V9VX4)) ^953405440
- A consumer of the event stream is not allowed to reject an event: by the time the consumer sees the event, it is already an immutable part of the log, and it may have already been seen by other consumers. ([Page 704](zotero://open-pdf/library/items/NUWYWJMI?page=703&annotation=ZK473SXK)) ^953405441
- Pistachio is a distributed key-value store that uses Kafka as a commit log [56], and Kafka Connect sinks can export data from Kafka to various different databases and indexes [41]. ([Page 708](zotero://open-pdf/library/items/NUWYWJMI?page=707&annotation=4DD6DR8P)) ^953405442
#future-reference
- Storing data is normally quite straightforward if you don’t have to worry about how it is going to be queried and accessed; many of the complexities of schema design, indexing, and storage engines are the result of wanting to support certain query and access patterns (see Chapter 3). ([Page 708](zotero://open-pdf/library/items/NUWYWJMI?page=707&annotation=ZJATJTNT)) ^953405443
- This idea is sometimes known as command query responsibility segregation (CQRS) [42, 58, 59]. ([Page 708](zotero://open-pdf/library/items/NUWYWJMI?page=707&annotation=2T3BKCGQ)) ^953405444
#feature-reading
- Deletion is more a matter of “making it harder to retrieve the data” than actually “making it impossible to retrieve the data.” ([Page 711](zotero://open-pdf/library/items/NUWYWJMI?page=710&annotation=28ELNVAQ)) ^953405445
- a stream processor consumes input streams in a readonly fashion and writes its output to a different location in an append-only fashion. ([Page 712](zotero://open-pdf/library/items/NUWYWJMI?page=711&annotation=SE5VDRVC)) ^953405446
- Complex event processing (CEP) is an approach developed in the 1990s for analyzing event streams, especially geared toward the kind of application that requires searching for certain event patterns [65, 66]. ([Page 713](zotero://open-pdf/library/items/NUWYWJMI?page=712&annotation=FYZIKATL)) ^953405447
- We can regard these examples as specific cases of maintaining materialized views (see “Aggregation: Data Cubes and Materialized Views”): deriving an alternative view onto some dataset so that you can query it efficiently, and updating that view whenever the underlying data changes [50]. ([Page 715](zotero://open-pdf/library/items/NUWYWJMI?page=714&annotation=Q92BELYV)) ^953405448
- Conventional search engines first index the documents and then run queries over the index. By contrast, searching a stream turns the processing on its head: the queries are stored, and the documents run past the queries, like in CEP. ([Page 716](zotero://open-pdf/library/items/NUWYWJMI?page=715&annotation=PK8SFK29)) ^953405449
- The difference to batch jobs is that a batch job uses a point-in-time snapshot of the database as input, whereas a stream processor is long-running, and the contents of the database are likely to change over time, so the stream processor’s local copy of the database needs to be kept up to date. ([Page 726](zotero://open-pdf/library/items/NUWYWJMI?page=725&annotation=W84C8UFK)) ^953405450
- One solution is to break the stream into small blocks, and treat each block like a miniature batch process. This approach is called microbatching, and it is used in Spark Streaming [91]. ([Page 729](zotero://open-pdf/library/items/NUWYWJMI?page=728&annotation=7GLY79VH)) ^953405451
- A variant approach, used in Apache Flink, is to periodically generate rolling checkpoints of state and write them to durable storage [92, 93]. ([Page 730](zotero://open-pdf/library/items/NUWYWJMI?page=729&annotation=UNDVMHJY)) ^953405452
- Most consensus algorithms are designed for situations in which the throughput of a single node is sufficient to process the entire stream of events, and these algorithms do not provide a mechanism for multiple nodes to share the work of ordering the events. ([Page 750](zotero://open-pdf/library/items/NUWYWJMI?page=749&annotation=T7UYUNBG)) ^953405453
- In the lambda approach, the stream processor consumes the events and quickly produces an approximate update to the view; the batch processor later consumes the same set of events and produces a corrected version of the derived view. ([Page 755](zotero://open-pdf/library/items/NUWYWJMI?page=754&annotation=CL5NLDKS)) ^953405454