Intro
Assorted Performance Insights
Use a Better Algorithm First
Intern/Avoid Strings When Possible
String Interning
Reducing string operations
Caches Dominate
An Example
Flame Graphs are Awesome
Flame Graph Example:
Iterators are fun
Benchmarking is Good
Letting Go & Moving On

❗

This is mostly technical aspects & performance thoughts, my overall impression of the experience can be found at the end, “Letting Go & Moving On.”

Intro

For those that don’t know Advent of Code is described as:

an Advent calendar of small programming puzzles for a variety of skill sets and skill levels that can be solved in any programming language you like. People use them as a speed contest, interview prep, company training, university coursework, practice problems, or to challenge each other.

Essentially 25 two-part programming puzzles are released, one per day, starting on December 1st every year. People complete to solve them the fastest or in novel ways. I started a bit too late to compete on finding a solution first (March 2022).

However, I did decide to complete Advent of Code 2021 for a couple reasons:

I’d never really tried competitive programming
I wanted to get more comfortable with Rust.
Puzzles are fun

After solving all the problems, I found that what I really enjoyed most was improving the performance of each solution, as well as the cumulative run-time of all solutions overall.

Ultimately, I ended up at ~370 ms without Day 19 and ~4.5 s with Day 19. If I didn’t pull the plug on optimization, I would have probably felt satisfied somewhere around ~200ms total (Day 19 included).

I operated with some implicit constraints (only one external crate used, sparingly, solutions should be relatively intuitive, don’t exploit specific input characteristics too much).

I’ll discuss some interesting and specific performance-related tidbits I came away with, as well as the reasons I decided to let go of performance optimization further below. You can see my solutions here

Favorite puzzle: Day 24, this was a really cool puzzle which used a series of obfuscated commands to simulate a stack. A very nice aha moment.

Assorted Performance Insights

Use a Better Algorithm First

As has been echoed repeatedly elsewhere; the best starting point is the overall algorithm. There are many cases where whole classes of problems can be obviated by just doing something a better way. Doing less work from the outset makes everything a lot better.

To give an example, on Day 18 I initially rushed in and implemented a very awkward tree representation, using an Interior Mutability pattern (many reference-counting pointers; Rc<RefCell<T>>) and node parent/child references.

A quick search shows 28 Rc::clone() calls, ouch.

Replacing this with a straightforward array-backed Binary Tree implementation immediately resulted in a >10x gain.

Intern/Avoid Strings When Possible

I had an aversion to strings based on past experience and gut instinct, so it was nice to see some confirmation from other sources, e.g. Why the Sorbet typechecker is fast.

Basically, we want to do the least possible amount of operations on strings themselves, and replace them with something else as much as possible via interning.

String Interning

I first heard of the term for this concept from Why the Sorbet typechecker is fast. Essentially we can create a consistent mapping from known string values to something else, like unsigned 8-bit integers, which are much better for performance in every way.

In Rust, a simple example of a string interner which I used, with some tweaks, is:

🔮

If we don’t care about the original string, we could also throw out everything related to the vec and convert the HashMap to a HashSet.

Reducing string operations

This is generally pretty mechanically trivial but can be surprisingly effective if you decide to pay attention. I’ll show three approaches in the context of a trivial solution (Day 2 Part 2) below:

Benchmark time: 212.16 µs

A less redundant approach:

Benchmark time: 47.933 µs, a ~78.5% improvement!

An even more stripped-down approach:

Benchmark: 34.101 µs, a ~26.1% improvement on our second approach, 6x faster than the initial approach.

🔮

If your input is well-defined bytes() is an easy drop-in replacement to chars() for a small performance gain. 48 is the offset to ‘0’.

Caches Dominate

Caches are extremely important to performance. To quote from the sacred texts:

Modern hardware is incredibly sensitive to hit rates on CPU caches, and so our core data structures were designed… to maximize cache locality.

Caches are generally very small (Intel i7-4770 L1 instruction cache = 32 KB), so in cases where we’re dealing with many references/values reducing size matters, e.g. u64 --> u8.

CPU caches also typically store cache lines, typically a 64 byte chunk of memory, from around the address of accessed memory (outside cache) as an optimization to reduce total round-trips to memory.

An Example

A common approach to enhance cache locality and memory access patterns is flattening arrays.

In the context of Rust, a Vec consists of a (pointer, capacity, length) triplet. This means that if we operate on a 2D Vec (Vec<Vec<T>>) each inner Vec necessitates a pointer-follow to memory that is relatively less likely to be contiguous.

Thus, it’s generally better to use a one-dimensional Vec (or a 2D array) and handle the access logic yourself, necessitating only one pointer-follow.

So, for Advent of Code Day 15 Part 2, which involves a 250k element 2D grid, shifting from a 2D Vec to a 1D Vec resulted in >40% benchmarked improvements.

Flame Graphs are Awesome

Optimizing before profiling leads to wasted time. Performance bottlenecks can be very unintuitive.

Using [cargo-]flamegraph to generate flame graphs with dtrace was massively useful. There were so many cases where my first instinct was to perform a holistic analysis beginning at the algorithmic level, when execution time was essentially dominated by one operation in a hot code path.

Flame Graph Example:

Day 15 Part 2 — Simple Djikstra’s solution over a 2D grid (represented by a 1D Vec). Can only move up, down, left, right. Start: top-left, Goal: bottom-right:

This benchmarks at: ~45ms

Just looking at this it’s not immediately clear where the lowest-hanging fruit is, although there are some suspicious bits.

Let’s generate a flame graph.

Drilling down into the flame graph shows that there are 3 offenders:

alloc::collections::binary_heap::BinaryHeap<T>::pop (21.28%)
alloc::raw_vec::RawVec<T,A>::reserve_for_push (14.89%)
libsystem_malloc.dylib`free (10.64%)

Let’s leave the Binary Heap pop() alone for the moment, since it’s a bit more tedious to work around. We can see that allocating memory for a Vec and calling free combined taking up more time than anything else.

Here’s our culprit:

The gen_adj_list() closure is initializing and pushing up to four elements to a Vec for every single iteration of the loop that is our hottest code path.

Since we know that there can only be up to 4 adjacent elements we can refactor the closure to re-populate the same array every loop iteration.

New Benchmark: ~22 ms (> 2x faster)

🔮

Incidentally, the above flame graph shows a limitation of tools such as dtrace which are based on sampling applications at fixed rates (often every few ms); they become difficult to utilize for long-tail optimization.

Other examples where flame graphs helped were exposing HashMap operations that were dominating execution time, prompting me to use my single external crate, fnv (Fowler–Noll–Vo hash function; more efficient for smaller hash keys).

Iterators are fun

This is not exactly Rust-specific, but iterators in general are very fun to compose, check out this pretty arbitrary sum based on adjacent three-element sliding windows:

They incidentally are also a nice something to keep in mind when working to reduce allocations (do you really need that .collect()?).

I played around with some really verbose compositions, but noticed that the performance benefits dropped off pretty fast, sadly.

Benchmarking is Good

It’s kind of embarrassing how long it took me to set up basic bench-marking for individual solutions and for the entire solution set. As measurements become more granular, isolating the signal from the noise is increasingly important. I chose to use criterion.

Letting Go & Moving On

Up until now, I focused on the technical aspects of performance work and any relevant takeaways I had, as someone who never worked on anything like this prior, from the perspective of improving performance. However, this wouldn’t be complete without including my overall impression/reflection on the process.

Ultimately, I did come to feel that at a certain point optimizing problems just for the sake of watching a number go down was a bit lackluster. Yes, you can start packing every conceivable thing into some kind of bit set, but at a certain point it feels very detached from any real-world impact. The rabbit-hole goes on forever, for every great solution, there’s always another level.

At one point, I fell into a routine where when I was bored I would run my little ./exec_all.sh script (cargo bench takes longer; warmup) and pick a top-3 runtime problem to go optimize. It’s intrinsically satisfying to watch numbers go down, sometimes by 10x, but it’s also hard to ignore the nagging feel in the back of your head that there’s no real point to doing any of this. I could’ve built some really cool stuff in the time I spent doing this (though I’m glad I tried it out).

I’m a product-first programmer first and foremost, I want to make things that deliver value to other people, not obsess over code as some kind of abstract intellectual exercise, detached from value-creation. I think that’s reflected in what I’ve worked on (college search, automated tennis feedback, startup, etc.).

So, I consciously restrained myself from optimizing Day 19 and decided to call it on Advent of Code 2021. It was a fun experience, but I don’t expect that I’ll be back.

Thoughts on Advent of Code 2021