Call/Return Microbenchmarking and Call Frame Allocation

This is a note on two semi-connected topics: call/return performance and call frame allocation strategies.

Call/Return Performance

The overhead of procedure calls is not the hottest topic in software performance engineering, but it has gotten some attention over the years. Questions related to direct versus indirect calls (virtual dispatch and/or higher-order functions) in particular are pretty interesting. Here I just want to suggest a simple microbenchmark for evaluating call/return overhead.

int f0( int x )
    /* The details of this calculation are not important */
    return ( x + 163 ) % 389;

int f1( int x )
    return f0( f0( x ) );

int f2( int x )
    return f1( f1( x ) );

/* ... */

int fN( int x )
    return fN1( fN1( x ) );

I call this the binary call tree benchmark. I'd be happy to hear if this is already in common use somewhere. The first interesting feature of this benchmark is that it has no loops or recursion, yet its run time is exponential in the length of the code. fN makes two calls to fN-1, each of which makes two calls to fN-2 and so on. So f0 will be called 2N times, f1 will be called 2(N - 1) times and so on.

The computation in f0 is pretty minimal. The intention is that it's just complicated enough that the compiler won't just compute the whole thing at compile time. So the majority of the time is spent in calling, returning and parameter passing. On a couple-year-old MacBook Air a C version of this benchmark works out to about 4.4 cycles per call. That seems pretty darn good. It's hard to imagine a general-purpose system doing much better than that.

I tested a small handful of other languages, of which I'll report a couple points of interest to me. The first is just for entertainment: Python is around 2 orders of magnitude slower than C at this microbenchmark. Everyone knows that Python's performance is generally pretty bad, but that seems almost comical to me. How can it take more than 400 cycles just to do a procedure call and return?

The second interesting case is Standard ML, which came in a bit under 2× slower than C. That's not great, but it's really not bad in light of what's coming in the second section of this note.

Usual Microbenchmarking Caveat

No application spends all its time just calling and returning. I would guess that call/return overhead is pretty far down the list of most language implementers' performance concerns. However, if you are thinking about monkeying around with procedure calls in some way, I submit that this benchmark is a decent sanity check on the overhead you're introducing.

Call Frame Allocation

My main motivation for writing this note is some musing about call frame allocation. In mainstream language implementations there is a strong concensus around stack-based call frame allocation. The frames in a chain of calls are allocated in one large continguous block of memory. When a call happens, the stack pointer is incremented by the correct amount to make space for the callee's new frame.

There are language implementations that use other call frame allocation strategies. In particular, one can allocate each frame as a separate object in the heap. Heap-allocating call frames is substantially more expensive than stack allocation (more on this later), but it does provide some nice flexibility. For example, supporting first-class continuations is a heck of a lot simpler if you heap-allocate call frames.

Another motivation for heap allocation is threads. Most implementations of threads (and cousins) allocate a large block of memory for each thread's call frames at thread spawn time. This is by far the simplest thing to do, if you insist on using a stack-allocation strategy for call frames.

Unfortunately, preallocating stack space for threads comes with a nasty engineering tradeoff.

If one happens to be be interested in programming styles that involve lots of threads, preallocating stack space is not so attractive. So how bad is the overhead for heap allocation of call frames? Let's look at two extremes.

I have implemented a relatively simple system for translating C code to a form that uses heap allocation for call frames. This system is similar to Gabriel Kerneis' Continuation-Passing C. My own initial experiments are roughly in line with Gabriel's published numbers on CPC: the binary call tree microbenchmark has more than 20× overhead relative to plain C. That's a pretty eye-popping number (though not as bad as Python!).

The other extreme is Standard ML, which I briefly referred to earlier. If my understanding of the SML/NJ compiler implementation is correct, they use heap allocation for stack frames. This is a reasonable choice, since SML/NJ supports first-class continuations. As noted above, SML/NJ's overhead on the benchmark is less than 2×. My guess is that the two main reasons for this big overhead gap are:

My hope is that with careful engineering it's possible to use heap allocation for more mainstream languages with overhead closer to SML/NJ's than CPC's.

Hybrid Heap/Stack Allocation

Here's an idea for bringing the overhead of heap allocation of call frames down. I am not aware of anyone having implemented this idea; I would love to hear about any such efforts. The idea is based on the observation that in most applications, most calls are short-lived and happen near the leaves of the (dynamic) call tree.

We could pre-allocate a small block of memory that is sized to hold just a few call frames. Relatively long-lived calls (generally nearer the root of the call tree) can be heap-allocated. The allocation overhead can be amortized over the long life of such calls. Short-lived calls can be stack-allocated in the small preallocated memory.

How do we distinguish between long-lived and short-lived calls? There could be some automatic analysis (static or dynamic). We can also imagine adding some programmer annotation for leaf/near-leaf procedures.

How much might such a strategy save us? Well, I tweaked my binary call tree benchmark to give us some idea. I arranged to use heap allocation for fN down to fK, then stack allocation for fK-1 down to f0, where K is a configurable parameter. Recall that with pure heap allocation the overhead compared to plain C is more than 20×. I found that with K=4 the overhead got down below 2×. This suggests that the overhead of heap allocation can be dramatically reduced by using stack allocation for a relatively thin slice of the near-leaf portion of the call tree.

Benjamin Ylvisaker, October 2014