Some More Heavy C Wizardry
2025-06-11
In the past I’ve talked about C’s janky lambdas, ie. local anonymous functions with a cutesy interface using macros. This is a continuation of that post, not because I want to talk about lambdas, but because we are doing things you’re not supposed to in C.
Some time ago I was watching Tsoding’s (I
highly recommend every vod) 1 stream and came across
the idea of raw dogging the program counter and implementing coroutines
by hand in C. Having been successfully nerd-sniped, I decided to see if
I could roll my own implementation just referencing what Alexey had
used: Musl’s implementation of setjmp
longjmp
(for x86_64).
In this blog post, I’d like for us to reconstruct my implementation from first principles;
Coroutines are subroutines’ more charismatic and sociable cousins; instead of arranging themselves in rigid hierarchies, they are all about hippie stuff like cooperation. In less allegorical terms, coroutines are a unit of computation much like a function, but which can be suspended and resumed, yielding control of the execution of the program to other cooperating coroutines. Resumption of control implies that the state of the coroutine prior to its suspension has a degree of permanence.
Generators can be thought of as being backed by a coroutine, for instance. They also look like “green threads”, wherein the program manages its own pooling of tasks to perform, without OS threads getting involved a priori.
We are interested in stackful coroutines, because they are more interesting, and because C++ can’t do them yet as far as I know. A coroutine system is stackful when voluntary suspension (yielding) of control can occur deep inside a coroutine’s call stack, as opposed to only on the frame of the coroutine itself. These are particularly interesting for complex generators, such as parsers which by necessity produce complex trees, which sometimes have to be checked for correctness: maintaining the state of the parser for it to backtrack on a particular parsing of a token stream is inherently useful.
I’ve reproduced the code from Musl here for our convenience:
2011-2012 Nicholas J. Kain, licensed under standard MIT license */
/* Copyright setjmp:
mov %rbx,(%rdi) /* rdi is jmp_buf, move registers onto it */
mov %rbp,8(%rdi)
mov %r12,16(%rdi)
mov %r13,24(%rdi)
mov %r14,32(%rdi)
mov %r15,40(%rdi)
lea 8(%rsp),%rdx /* this is our rsp WITHOUT current ret addr */
mov %rdx,48(%rdi)
mov (%rsp),%rdx /* save return addr ptr for new rip */
mov %rdx,56(%rdi)
xor %eax,%eax /* always return 0 */
ret
longjmp:
xor %eax,%eax
cmp $1,%esi /* CF = val ? 0 : 1 */
adc %esi,%eax /* eax = val + !val */
mov (%rdi),%rbx /* rdi is the jmp_buf, restore regs from it */
mov 8(%rdi),%rbp
mov 16(%rdi),%r12
mov 24(%rdi),%r13
mov 32(%rdi),%r14
mov 40(%rdi),%r15
mov 48(%rdi),%rsp
jmp *56(%rdi) /* goto saved address without altering rsp */
Let’s first look at setjmp
. If you’re not familiar with
assembly code, not only do I envy you, I will work towards worsening
your life by explaining it. To wit, %rdi
corresponds to the
register the first (in this case only) parameter the function receives
from the C call convention in x86_64. The setjmp
gets a
pointer to a data structure called the jump buffer. It is not a public
interface we get to poke at, but by inspection we can see what it
comprises: of RBX (the Roblox™©® register); RBP, the stack frame base
pointer register; R12–R15 which are other non-volatile registers; then
the RSP stack pointer registers (the lea
stuff just removes
the bottom of the stack which corresponds to the return address of that
setjmp
call, then the actual return address, and the RDX
register.
The key takeaway is we save a bunch of registers to memory, which we
will later restore in longjmp
by referring to those values.
The important bit is that we may restore the registers and jump to the
saved return address, largely making it look like we returned from the
original call to setjmp
but with EAX set to something other
than 0, and therefore returning an indicator that we actually jumped
there.
If you squint a bit you might feel like this offers some justification for perverting and mangling C, as well as the means to do so.
I haven’t tested it very extensively, but this header-only implementation seems to work-on-my-machine™:
#ifndef CORO2_H
#define CORO2_H
#include <stdbool.h>
void coro_yield();
void coro_spawn(void (*fn)(void*), void*);
bool coro_poll();
#endif
#ifdef CORO2_IMPLEMENTATION
#include <stddef.h>
#include <stdlib.h>
#include <sys/mman.h>
#define auto_t __auto_type
#define typeof __Typeof
#define apush(vec, val) ({\
auto_t x = &(val);\
auto_t v = &(vec);\
if (v->count == v->capacity) {\
v->capacity = v->capacity? 2 * v->capacity: 1;\
v->items = realloc(v->items, sizeof(*x) * v->capacity);\
if (v->items == NULL)\
abort();\
}\
v->items[v->count++] = *x;\
})
#ifndef CORO_STACK_SIZE
#define CORO_STACK_SIZE (4 << 20 /* 4MB */)
#endif
struct coroutine {
void *rsp, *base;
bool dying;
};
struct _global_state {
struct coroutine* items;
size_t capacity, count, dying, current;
};
static struct _global_state _ = {0};
static struct coroutine* current(void);
static void restore(void*);
static void reap(void);
static void die(void);
static void save_rsp(void* rsp);
static void jmp2next(void);
((constructor)) static void coro_init() {
__attribute__(_, (struct coroutine){0});
apush}
((destructor)) static void coro_fini() {
__attribute__(_.items);
free= (struct _global_state) {0};
_ }
((naked)) void coro_yield() {
__attribute__("push %rdi");
__asm__("push %rbp");
__asm__("push %rbx");
__asm__("push %r12");
__asm__("push %r13");
__asm__("push %r14");
__asm__("push %r15");
__asm__("mov %rsp, %rdi");
__asm__("jmp *%0"::"r" (save_rsp));
__asm__}
void coro_spawn(void (*fn)(void*), void* arg) {
static int prot = PROT_READ | PROT_WRITE;
static int flag = MAP_PRIVATE | MAP_ANONYMOUS | MAP_GROWSDOWN;
void* base = mmap(NULL, CORO_STACK_SIZE, prot, flag, -1, 0);
void** rsp = base;
*(--rsp) = die;
*(--rsp) = fn;
*(--rsp) = arg; // %rdi
*(--rsp) = 0; // %rbp");
*(--rsp) = 0; // %rbx");
*(--rsp) = 0; // %r12");
*(--rsp) = 0; // %r13");
*(--rsp) = 0; // %r14");
*(--rsp) = 0; // %r15");
(_, ((struct coroutine){.rsp = rsp, .base = base}));
apush}
bool coro_poll() {
return _.count > 1;
}
void die() {
if (_.count == 0)
();
abort()->dying = true;
currentstruct coroutine tmp = *current();
*current() = _.items[_.count - 1];
.items[_.count - 1] = tmp;
_
.count -= 1;
_if (_.count) {
.dying += 1;
_.current += 1;
_.current %= _.count;
_(current()->rsp);
restore} else {
// munmap'ing our own stack sounds like an awful idea
// just exit.
(EXIT_SUCCESS);
exit}
}
((naked)) void restore(void* rsp) {
__attribute__(""::"m" (rsp));
__asm__ __volatile__("mov %rdi, %rsp"); (void) rsp; // probably illegal :)
__asm__ // we can now reap the dying stacks, as our stack isn't dying
();
reap("pop %r15");
__asm__("pop %r14");
__asm__("pop %r13");
__asm__("pop %r12");
__asm__("pop %rbx");
__asm__("pop %rbp");
__asm__("pop %rdi");
__asm__("ret");
__asm__}
void reap() {
for (size_t i = _.count; i < _.count + _.dying; i++) {
(_.items[i].base, CORO_STACK_SIZE);
munmap.items[i] = (struct coroutine) {0};
_}
.dying = 0;
_}
struct coroutine* current() {
if (_.count == 0)
();
abortreturn &_.items[_.current];
}
void save_rsp(void* rsp) {
if (_.count == 0)
();
abort()->rsp = rsp;
current();
jmp2next}
void jmp2next(void) {
if (_.count == 0)
();
abort.current += 1;
_.current %= _.count;
_(current()->rsp);
restore}
#undef CORO2_IMPLEMENTATION
#endif
I haven’t really kept up with how Alexey’s changed his
implementation. I remember suggesting to use mmap
on the
comment section but I don’t know if he ever read it, probably not.
You may find the source and an example on these links. You need to compile with something like
gcc -I. -O0 -ggdb -std=c17 -D_DEFAULT_SOURCE coro.c
The C code itself it rather straight forward. What is a bit esoteric
is the swathes of inline assembly, and the fact that depending on very
mild changes, everything breaks. I basically got it to work
once and everything after that was incremental and hard-fought.
A past implementation I attempted to do from scratch for this blog post
simply didn’t work and I don’t know exactly what I was doing differently
from this implementation; changing things a bit made gcc
pretty upset, as it reprimanded me:
note: the value of the stack pointer after an ‘asm’ statement must be the same as it was before the statement
which is at the same time VERY reasonable, and quite unfortunate. Since this isn’t a production system, we’ll ignore the fact that the compiler might very well optimize our code into nasal demons etc.
First thing to do with any library is tracing back its public
interfaces to their implementations. Here we have three functions:
coro_poll()
is called to see if there are other coroutines
that could be resumed, or if we are all alone. This is important because
we don’t have a runtime to manage those coroutines for us, the way I’ve
set it up uses main()
to poll the coroutines, as it is the
only coroutine that surely will not exit (as if it does, the program
still terminates).
The concept of polling is perhaps as bit useless, because the responsibility of yielding execution control to peer routines is distributed, it just so happens main has to transfer said control, which puts it on the list of things that may be resumed, and therefore you kind of always have to coro_yield() so as to not squat the program counter.
The next public function in the scale of complexity is
coro_spawn
which works like a deferred call to a function.
We set up its stack frame as if it had been called but we just never
transfer control to it. With one key difference, the stack frame we
create is has just been allocated on the heap using mmap
,
creating a virtual mapping that grows downwards as needed until the
virtual memory manager gets fed up (when you encroach on an already
mapped page, for instance) and segfaults you.
This means that when we transfer the control to that function, it
will be as if it is our main
function, at the bottom of its
own separate and independent stack. With the same caveat that
main
has: it isn’t the true bottom of the stack,
main
is called by _start
which does some C
runtime stuff before calling main
, and then performs the
appropriate rites so that the program exits gracefully.
We have the same kind of thing happening, at the bottom of its stack,
a coroutine will have a guard die
function where its return
address ought to be. When/if the routine returns, die
will
not be so much invoked but rather returned to (even though we
never called the routine from die
) and it will handle
coroutine cleanup and switching to a different one if appropriate.
That is the essence of
*(--rsp) = die;
*(--rsp) = fn;
*(--rsp) = arg; // %rdi
*(--rsp) = 0; // %rbp");
*(--rsp) = 0; // %rbx");
*(--rsp) = 0; // %r12");
*(--rsp) = 0; // %r13");
*(--rsp) = 0; // %r14");
*(--rsp) = 0; // %r15");
But why pass all those registers through the stack? Well, it’s because in the process of switching between coroutine contexts, we save (what would normally be callee-saved) on the stack, then save the stack pointer itself on a global list of coroutine structures. When we restore/resume a particular coroutine, it takes from that stack and restores its own context before handing control back to the place that originally yielded it.
You should note that both in coro_yield
and
restore
we do not, in fact, store any information regarding
where to return to when we in fact restore the coroutine’s execution.
This is because the compiler already takes care of that: when we call
coro_yield, the compiler pushes the return address onto the stack, we
then do our register preservation thing we partially lifted from
setjmp
.
But we never return from coro_yield
, we jump
ie. transfer control to a function that saves the current stack pointer
in the current coroutine structure, advances the active/current
coroutine, then restores its context. Context restoration
involves undoing the register protection we performed when that
particular coroutine yielded, at which point the stack is “as if” we had
just just entered coro_yield
, and the very next instruction
is a ret
, meaning we do a “return” with exactly the same
registers (the ones that the calling convention says that matters, at
least) and stack we had just after calling coro_yield
.
This means that coro_yield
“looks like” it does nothing
at all, from the perspective of the function that yielded; it just went
to sleep for a while, and just like pausing your conscience for a
second, you can only observe it through the side-effects of things you
did not observe.
Some more nuanced things is how the dying coroutines mark themselves
as dying, and move themselves to “past” the end of the array of
coroutine objects so as to not be resumed, when we transfer control to
an active coroutine, in restore, we are entitled to reap the dying
coroutines, as our coroutine isn’t dying so we won’t run the
risk of munmap
ping our own stack from under our feet.
Well, frankly, all of them. This seems to be plainly illegal, if GCC
is anything to go by. It works but debugging is essentially impossible
aside from actually stepping through the program on gdb
and
trying to find the exact moment you get an error; if you miss it, you’ll
switch stacks and backtracing will literally not work right.
gdb
simply does not expect you to do something this
cursed.
Similarly, valgrind
gets very upset and crashes outright
because of some hi-lo discrepancy that I don’t presume to understand
ever.