# Fabian Reinartz - Writing a Time Series Database From Scratch (Highlights) ![rw-book-cover|256](https://readwise-assets.s3.amazonaws.com/static/images/article3.5c705a01b476.png) ## Metadata **Cover**:: https://readwise-assets.s3.amazonaws.com/static/images/article3.5c705a01b476.png **Source**:: #from/readwise **Zettel**:: #zettel/fleeting **Status**:: #x **Authors**:: [[Fabian Reinartz]] **Full Title**:: Writing a Time Series Database From Scratch **Category**:: #articles #readwise/articles **Category Icon**:: 📰 **Document Tags**:: #database #performance **URL**:: [fabxc.org](https://fabxc.org/tsdb/) **Host**:: [[fabxc.org]] **Highlighted**:: [[2021-02-21]] **Created**:: [[2022-09-26]] ## Highlights - Each data point is a tuple of a timestamp and a value. - the timestamp is an integer and the value any number. - the blog series “Coding for SSDs” series is a an excellent resource #further-reading #rl [“Coding for SSDs” series](http://codecapsule.com/2014/02/12/coding-for-ssds-part-1-introduction-and-table-of-contents/) - the only batches we get are vertical sets of data points across series. - ideally, samples for the same series would be stored sequentially so we can just scan through them with as few reads as possible. - There’s obviously a strong tension between the ideal pattern for writing collected data to disk and the layout that would be significantly more efficient for serving queries. - We create one file per time series that contains all of its samples in sequential order. - we batch up 1KiB chunks of samples for a series in memory and append those chunks to the individual files - It also enables incredibly efficient compression formats, based on the property that a given sample changes only very little with respect to the previous sample in the same series - Facebook’s paper on their Gorilla TSDB describes a similar chunk-based approach and introduces a compression format that reduces 16 byte samples to an average of 1.37 bytes. #further-reading #rl http://www.vldb.org/pvldb/vol8/p1816-teller.pdf - keeping a separate file for each series is troubling the V2 storage for various reasons: - We actually need a lot more files than the number of time series we are currently collecting data for. - several thousands of chunks per second are completed and ready to be persisted. - It’s infeasible to keep all files open for reads and writes. - Eventually, old data has to be deleted and data needs to be removed from the front of millions of files. - Chunks that are currently accumulating are only held in memory. - If the application crashes, data will be lost. - In the Prometheus context, we use the term series churn to describe that a set of time series becomes inactive - The current V2 storage of Prometheus has an index based on LevelDB for all series that are currently stored. - lacks a scalable way to combine results from different label selections. - We partition our horizontal dimension, i.e. the time space, into non-overlapping blocks. Each block acts as a fully independent database containing all time series data for its time window. - Every block of data is immutable. - all new data is written to an in-memory database - To prevent data loss, all incoming data is also written to a temporary write ahead log, - This layout allows us to fan out queries to all blocks relevant to the queried time range. - When completing a block, we can persist the data from our in-memory database by sequentially writing just a handful of larger files. - recent chunks, which are queried most, are always hot in memory. - we are also no longer bound to the fixed 1KiB chunk size to better align data on disk. - Deleting old data becomes extremely cheap and instantaneous. - meta.json file. It simply holds human-readable information about the block - a handful of larger allows us to keep all files open with little overhead. This unblocks the usage of mmap(2) - keeping each block reasonably short - avoid accumulating too much data in memory - When querying multiple blocks, we have to merge their results into an overall result. - Compaction describes the process of taking one or more blocks of data and writing them into a, potentially larger, block. - we need a more capable inverted index.