# Aleksey Kladov - It’s Not Always iCache (Highlights) ![rw-book-cover|256](https://readwise-assets.s3.amazonaws.com/static/images/article2.74d541386bbf.png) ## Metadata **Cover**:: https://readwise-assets.s3.amazonaws.com/static/images/article2.74d541386bbf.png **Source**:: #from/readwise **Zettel**:: #zettel/fleeting **Status**:: #x **Authors**:: [[Aleksey Kladov]] **Full Title**:: It’s Not Always iCache **Category**:: #articles #readwise/articles **Category Icon**:: 📰 **URL**:: [matklad.github.io](https://matklad.github.io//2021/07/10/its-not-always-icache.html) **Host**:: [[matklad.github.io]] **Highlighted**:: [[2021-07-25]] **Created**:: [[2022-09-26]] ## Highlights - On Linux, the best tool to quickly access the performance of any program is perf stat. #performance #example #linux code ``` $ perf stat -e instructions,cycles,\ L1-dcache-loads,L1-dcache-load-misses,L1-dcache-prefetches,\ L1-icache-loads,L1-icache-load-misses,cache-misses \ ./always ``` ` - While perf takes the real data from the CPU, an alternative approach is to run the program in a simulated environment. That’s what cachegrind tool does. - Note that the number of times CPU refers to iCache should correspond to the number of instructions it executes. ### Conclusions - Inlining might cause C to use more registers. This means that prologue and epilogue grow additional push/pop instructions, which also use stack memory. - Generalizing from the first point, if S is called in a loop or in an if, the compiler might hoist some instructions of S to before the branch, moving them from the cold path to the hot path. - With more local variables and control flow in the stack frame to juggle, compiler might accidentally pessimize the hot loop.