This note is:
Single-threaded software has it easy with regard to procedure call frame allocation. The widely followed convention is that a large fraction of a process's virtual address space is devoted to "the stack", which is where call frames are stored. A register contains a pointer to the deepest frame (i.e. the currently executing call). During calls and returns, allocating and freeing frame memory requires only adjusting that pointer.
Multithreaded software can't use this simple convention. Each thread needs to allocate call frames independently of other threads. The common solution to this dilemma is to allocate a relatively large block of contiguous memory at thread spawn time for that thread's frames. With such a block of memory, procedure calling can be carried out exactly as it is in single-threaded applications. When a processor switches from one thread to another is just has to switch the end-of-stack pointer.
There is a strong and unpleasant engineering tension around how big these blocks of memory should be.
There is a large body of research, especially in the embedded systems community, on analyses to find just the right stack size for each thread in a system. Most general-purpose multithreading libraries have a fairly large default size (2MB on some systems) and allow client code to set a different one.
It's worth observing that even if you get exactly the right stack size for a thread, there is still potentially a substantial amount of wasted memory. The depth of the stack varies as procedures are called and return. The stack memory needs to be sized for the very deepest call chain, but when the call chain is shallower, the remainder of the space reserved for the stack is wasted.
These memory concerns are less acute on 64-bit machines than 32-bit machines. The address space is so large that it's sometimes not problematic to reserve a large block of memory for every thread. The virtual memory system won't actually allocate physical pages until the stack grows enough to need them. There's still a bit of a minimum granularity issue: spawning a thread costs at least a page of memory (4kB in many systems). More on this later.
Rather than allocating call frames in a nice big contiguous block of memory, we could allocate each one as an independent object in the heap. This completely side-steps the problems described above. Unfortunately, it comes with a collection of serious drawbacks of its own.
The most fundamental problem with heap allocating call frames is that it makes calling (and returning) considerably more expensive. The simplest technique is to do a heap allocation for every call and a deallocation for every return. Of course it's possible to pull out the bag of memory allocation tricks, and some of these can marginally improve performance.
Some functional language implementations use heap allocation and are highly optimized for it. Even in these implementations, calls and returns are (very roughly speaking) twice as expensive as in C. Less optimized implementations are substantially worse.
[quick tangent] Why do some functional language implementations use this strategy if it comes with non-trivial overhead? It makes implementing first-class functions and first-class continuations a heck of a lot easier (compared to conventional contiguous stacks). [/quick tangent]
Of course applications do a lot more than just call and return. Nevertheless, this magnitude of overhead is problematic for many performance-sensitive applications.
There is an awful lot of code out there that expects call frames to be allocated using the contiguous stack strategy. This includes tools like compilers, linkers, debuggers and profilers, and libraries that we would like to be able to reuse as-is.
Split stacks are a compromise. On thread spawn a small contiguous block of memory called a stack segment is allocated. Most calls and returns execute as they would with a conventional stack. However, if a call requires more memory than is available in the current segment, the system allocates a new segment and links it to the previous one. When the corresponding return happens, the system frees the segment and follows the link back to the previous one.
At first blush, split stacks seem like a nice hybrid solution. They avoid the need to pre-allocate large blocks for each thread and they can accommodate arbitrarily deep call chains. In fact it's a sufficiently attractive idea that both gcc and llvm have had at least unofficial support for split/segmented stacks. Unfortunately, this strategy has some problematic weaknesses of its own.
With split stacks, all calls and returns are slightly more expensive than with contiguous stacks because it's necessary to check whether you're at the end/beginning of a segment. In the normal case you won't be, but the check itself has some cost.
There is an uncommon scenario where a thread makes a call that causes a new segment to be allocated, then very quickly returns from that call, then makes another call and returns, and so on. In this scenario we're paying the (high) cost of segment management on every call and return. This cost can be reduced somewhat with clever engineering, but it can't be eliminated. For many applications this kind of performance unpredictability is worse than moderate common-case overhead.
Split stacks are sufficiently attractive that at least a few recently designed languages have adopted them as the default call frame management strategy. Unfortunately, as the consequences of the problems described above became clear, at least a couple of these projects have abandoned split stacks in favor of a more conventional stack strategy. For example:Rust developers' mailing list
This whole issue is really only problematic for applications with lots of threads. There are plenty of smart people who argue that multithreading is more trouble than it's worth for most applications. So who cares? The reason why I care — and bothered writing this note — is that I'm interested in cooperative multithreading. It's at least sane to consider software architectures that involve lots of cooperative threads, because they are not nearly as prone to concurrency bugs as regular (preemptive) threads. And cooperative threads have exactly the same stack management headache as regular threads.
If one wants to get really exotic and consider software architectures with many thousands of threads, the minimum granularity issue comes to the fore. Even with split stacks, the minimum memory cost for a thread is usually one or a few memory pages. For very "small" threads this represents a large amount of wasted memory due to internal fragmentation. I can't think of any way around this other than heap allocation of call frames.
Here is my proposal for resolving the thread/stack tension. It's based on the observation that in most applications the call lifetime distribution is long-tailed. That is, most calls are very short-lived and few calls are long-lived. If you think about calls in terms of a (dynamic) call tree, there are a few long-lived calls near the root and many short-lived calls near the leaves. Furthermore, most (dynamic) calls happen at a relatively small number of (static) call sites.
The per-call overhead of the heap allocation strategy is only important for calls that happen very frequently and must, by definition, be short-lived. So what about allocating the long-lived calls individually on the heap and the short-lived calls using a conventional strategy in a (small-ish) contiguous block of memory? There's a lot left to work out in this idea. For example, how do we know which calls are going to be short-lived? Maybe it should be a just-in-time style optimization. Not sure.
If you happen to know of any previous work on ideas like this, I would love to hear about it.
Benjamin Ylvisaker, October 2014