- Intro
- Assorted Performance Insights
- Use a Better Algorithm First
- Intern/Avoid Strings When Possible
- String Interning
- Reducing string operations
- Caches Dominate
- An Example
- Flame Graphs are Awesome
- Flame Graph Example:
- Iterators are fun
- Benchmarking is Good
- Letting Go & Moving On
Intro
For those that don’t know Advent of Code is described as:
an Advent calendar of small programming puzzles for a variety of skill sets and skill levels that can be solved in any programming language you like. People use them as a speed contest, interview prep, company training, university coursework, practice problems, or to challenge each other.
Essentially 25 two-part programming puzzles are released, one per day, starting on December 1st every year. People complete to solve them the fastest or in novel ways. I started a bit too late to compete on finding a solution first (March 2022).
However, I did decide to complete Advent of Code 2021 for a couple reasons:
- I’d never really tried competitive programming
- I wanted to get more comfortable with Rust.
- Puzzles are fun
After solving all the problems, I found that what I really enjoyed most was improving the performance of each solution, as well as the cumulative run-time of all solutions overall.
Ultimately, I ended up at ~370 ms without Day 19 and ~4.5 s with Day 19. If I didn’t pull the plug on optimization, I would have probably felt satisfied somewhere around ~200ms total (Day 19 included).
- I operated with some implicit constraints (only one external crate used, sparingly, solutions should be relatively intuitive, don’t exploit specific input characteristics too much).
I’ll discuss some interesting and specific performance-related tidbits I came away with, as well as the reasons I decided to let go of performance optimization further below. You can see my solutions here
Favorite puzzle: Day 24, this was a really cool puzzle which used a series of obfuscated commands to simulate a stack. A very nice aha moment.
Assorted Performance Insights
Use a Better Algorithm First
As has been echoed repeatedly elsewhere; the best starting point is the overall algorithm. There are many cases where whole classes of problems can be obviated by just doing something a better way. Doing less work from the outset makes everything a lot better.
To give an example, on Day 18 I initially rushed in and implemented a very awkward tree representation, using an Interior Mutability pattern (many reference-counting pointers; Rc<RefCell<T>>
) and node parent/child references.
- A quick search shows 28
Rc::clone()
calls, ouch.
Replacing this with a straightforward array-backed Binary Tree implementation immediately resulted in a >10x gain.
Intern/Avoid Strings When Possible
I had an aversion to strings based on past experience and gut instinct, so it was nice to see some confirmation from other sources, e.g. Why the Sorbet typechecker is fast.
Basically, we want to do the least possible amount of operations on strings themselves, and replace them with something else as much as possible via interning.
String Interning
I first heard of the term for this concept from Why the Sorbet typechecker is fast. Essentially we can create a consistent mapping from known string values to something else, like unsigned 8-bit integers, which are much better for performance in every way.
In Rust, a simple example of a string interner which I used, with some tweaks, is:
vec
and convert the HashMap
to a HashSet
.Reducing string operations
This is generally pretty mechanically trivial but can be surprisingly effective if you decide to pay attention. I’ll show three approaches in the context of a trivial solution (Day 2 Part 2) below:
Benchmark time: 212.16 µs
A less redundant approach:
Benchmark time: 47.933 µs, a ~78.5% improvement!
An even more stripped-down approach:
Benchmark: 34.101 µs, a ~26.1% improvement on our second approach, 6x faster than the initial approach.
bytes()
is an easy drop-in replacement to chars()
for a small performance gain. 48 is the offset to ‘0’. Caches Dominate
Caches are extremely important to performance. To quote from the sacred texts:
Modern hardware is incredibly sensitive to hit rates on CPU caches, and so our core data structures were designed… to maximize cache locality.
Caches are generally very small (Intel i7-4770 L1 instruction cache = 32 KB), so in cases where we’re dealing with many references/values reducing size matters, e.g. u64 --> u8
.
CPU caches also typically store cache lines, typically a 64 byte chunk of memory, from around the address of accessed memory (outside cache) as an optimization to reduce total round-trips to memory.
An Example
A common approach to enhance cache locality and memory access patterns is flattening arrays.
In the context of Rust, a Vec
consists of a (pointer, capacity, length) triplet. This means that if we operate on a 2D Vec
(Vec<Vec<T>>
) each inner Vec
necessitates a pointer-follow to memory that is relatively less likely to be contiguous.
Thus, it’s generally better to use a one-dimensional Vec
(or a 2D array) and handle the access logic yourself, necessitating only one pointer-follow.
So, for Advent of Code Day 15 Part 2, which involves a 250k element 2D grid, shifting from a 2D Vec
to a 1D Vec
resulted in >40% benchmarked improvements.
Related:
- Array2D provides a fixed sized two-dimensional array. It is more efficient…
- Grids in Rust, part 1: nested vs. flat Vecs
Flame Graphs are Awesome
Optimizing before profiling leads to wasted time. Performance bottlenecks can be very unintuitive.
Using [cargo-]flamegraph to generate flame graphs with dtrace was massively useful. There were so many cases where my first instinct was to perform a holistic analysis beginning at the algorithmic level, when execution time was essentially dominated by one operation in a hot code path.
Flame Graph Example:
Day 15 Part 2 — Simple Djikstra’s solution over a 2D grid (represented by a 1D Vec
).
Can only move up, down, left, right. Start: top-left, Goal: bottom-right:
This benchmarks at: ~45ms
Just looking at this it’s not immediately clear where the lowest-hanging fruit is, although there are some suspicious bits.
Let’s generate a flame graph.
Drilling down into the flame graph shows that there are 3 offenders:
alloc::collections::binary_heap::BinaryHeap<T>::pop
(21.28%)alloc::raw_vec::RawVec<T,A>::reserve_for_push
(14.89%)libsystem_malloc.dylib`free
(10.64%)
Let’s leave the Binary Heap pop() alone for the moment, since it’s a bit more tedious to work around. We can see that allocating memory for a Vec
and calling free
combined taking up more time than anything else.
Here’s our culprit:
The gen_adj_list()
closure is initializing and pushing up to four elements to a Vec
for every single iteration of the loop that is our hottest code path.
Since we know that there can only be up to 4 adjacent elements we can refactor the closure to re-populate the same array every loop iteration.
New Benchmark: ~22 ms (> 2x faster)
dtrace
which are based on sampling applications at fixed rates (often every few ms); they become difficult to utilize for long-tail optimization.Other examples where flame graphs helped were exposing HashMap
operations that were dominating execution time, prompting me to use my single external crate, fnv (Fowler–Noll–Vo hash function; more efficient for smaller hash keys).
Iterators are fun
This is not exactly Rust-specific, but iterators in general are very fun to compose, check out this pretty arbitrary sum based on adjacent three-element sliding windows:
They incidentally are also a nice something to keep in mind when working to reduce allocations (do you really need that .collect()
?).
I played around with some really verbose compositions, but noticed that the performance benefits dropped off pretty fast, sadly.
Benchmarking is Good
It’s kind of embarrassing how long it took me to set up basic bench-marking for individual solutions and for the entire solution set. As measurements become more granular, isolating the signal from the noise is increasingly important. I chose to use criterion.
Letting Go & Moving On
Up until now, I focused on the technical aspects of performance work and any relevant takeaways I had, as someone who never worked on anything like this prior, from the perspective of improving performance. However, this wouldn’t be complete without including my overall impression/reflection on the process.
Ultimately, I did come to feel that at a certain point optimizing problems just for the sake of watching a number go down was a bit lackluster. Yes, you can start packing every conceivable thing into some kind of bit set, but at a certain point it feels very detached from any real-world impact. The rabbit-hole goes on forever, for every great solution, there’s always another level.
At one point, I fell into a routine where when I was bored I would run my little ./exec_all.sh
script (cargo bench
takes longer; warmup) and pick a top-3 runtime problem to go optimize. It’s intrinsically satisfying to watch numbers go down, sometimes by 10x, but it’s also hard to ignore the nagging feel in the back of your head that there’s no real point to doing any of this. I could’ve built some really cool stuff in the time I spent doing this (though I’m glad I tried it out).
I’m a product-first programmer first and foremost, I want to make things that deliver value to other people, not obsess over code as some kind of abstract intellectual exercise, detached from value-creation. I think that’s reflected in what I’ve worked on (college search, automated tennis feedback, startup, etc.).
So, I consciously restrained myself from optimizing Day 19 and decided to call it on Advent of Code 2021. It was a fun experience, but I don’t expect that I’ll be back.