Jay Kreps - The Log What Every Software Engineer Should Know About Real-Time Data's Unifying Abstraction (Highlights)

# Jay Kreps - The Log: What Every Software Engineer Should Know About Real-Time Data's Unifying Abstraction (Highlights) ![rw-book-cover|256](https://engineering.linkedin.com/apps/settings/wcm/designs/linkedin/katy/global/clientlibs/resources/img/default-share-twitter.png) ## Metadata **Review**:: [readwise.io](https://readwise.io/bookreview/50757986) **Source**:: #from/readwise #from/reader **Zettel**:: #zettel/fleeting **Status**:: #x **Authors**:: [[Jay Kreps]] **Full Title**:: The Log: What Every Software Engineer Should Know About Real-Time Data's Unifying Abstraction **Category**:: #articles #readwise/articles **Category Icon**:: 📰 **Document Tags**:: #shortlist **URL**:: [engineering.linkedin.com](https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying) **Host**:: [[engineering.linkedin.com]] **Highlighted**:: [[2025-04-22]] **Created**:: [[2025-04-23]] ## Highlights - logs have a specific purpose: they record what happened and when ([View Highlight](https://read.readwise.io/read/01jscrpg446s6ngw0awpaaphfj)) ^880414647 - Amazon has offered a service that is very very similar to Kafka called [Kinesis](http://aws.amazon.com/kinesis). ([View Highlight](https://read.readwise.io/read/01jsd9cnhtwr8v3nqep6bxnmqy)) ^880482258 - The best model is to have *cleanup* done prior to publishing the data to the log by the publisher of the data. This means ensuring the data is in a canonical form and doesn't retain any hold-overs from the particular code that produced it or the storage system in which it may have been maintained. These details are best handled by the team that creates the data since they know the most about their own data. ([View Highlight](https://read.readwise.io/read/01jsd9rms8zxb4k2xs9j42k63r)) ^880483792 - The "event-driven" style provides an approach to simplifying this. The job display page now just shows a job and records the fact that a job was shown along with the relevant attributes of the job, the viewer, and any other useful facts about the display of the job. Each of the other interested systems—the recommendation system, the security system, the job poster analytics system, and the data warehouse—all just subscribe to the feed and do their processing. ([View Highlight](https://read.readwise.io/read/01jsd9ycnxk97qdnhqm75va9kh)) ^880487314 - At LinkedIn we are currently running over 60 billion unique message writes through Kafka per day (several hundred billion if you count the writes from [mirroring between datacenters](http://kafka.apache.org/documentation.html#datacenters)). ([View Highlight](https://read.readwise.io/read/01jsda089z6nyvexfejqb3efbt)) ^880488522 - Each partition is a totally ordered log, but there is no global ordering between partitions (other than perhaps some wall-clock time you might include in your messages). ([View Highlight](https://read.readwise.io/read/01jsda2r8977jbjep9760b48pe)) ^880488716 - A log, like a filesystem, is easy to optimize for linear read and write patterns. The log can group small reads and writes together into larger, high-throughput operations ([View Highlight](https://read.readwise.io/read/01jsda4n7bb2vms0xazyj7pkty)) ^880488774 - For those interested in more details, we have open sourced [Samza](http://samza.incubator.apache.org/), a stream processing system explicitly built on many of these ideas. ([View Highlight](https://read.readwise.io/read/01jsek4q74vjg1tmka8fxntvfk)) ^880671930 #future-reference