coro_yield2025-12-02
As regular readers (nullptr) may know, I am an avid azozing enjoyer.
Having watched Alexey’s recent video I
came to be quite dissatisfied with my coroutine
implementation I had previously blogged about. I had mentioned in
the past that assert shits the bed when called from within
a coroutine context (actually I don’t think I did lol, but it does. It
also upsets valgrind so much it dies of pure disgust).
Alexey also had a similar problem.
The problem is: we save rdi, rbp, rbx, r12-15 because
those are non-volatile (callee-saved) registers and we are transferring
control into some foreign part of the binary “as if” we had returned
from the yield call normally. The central issue seems to be
assert actually care about stack alignment, and those
registers are seven in number… which means we are 8 bytes off
from the 16 byte alignment normally seen and amd64 calling convention.
Yeah…
So this issue has been fixed, as suggested by a chatter in his live
stream, who goes by codecat69, which is what happens when
you have three Latex active characters in a row, I think.
The more relevant improvement that isn’t just lifted from a more talented programmer is our new interface. Prior art was the following:
void coro_yield();
void coro_spawn(void (*fn)(void*), void*);
bool coro_poll();meaning you have no control over which coroutine you yield to, when and if its data is released, etc. instead it just handles the lifetimes automatically at the cost of this extra flexibility. Realizing I was programming in C and lifetime concerns are a matter of 𝓼𝓴𝓲𝓵𝓵 𝓲𝓼𝓼𝓾𝓮 — provided the API is good enough, if you think an API is bad, then that is also a 𝖘𝖐𝖎𝖑𝖑 𝖎𝖘𝖘𝖚𝖊.
Our current API is as follows:
typedef struct coroutine coroutine_t;
coroutine_t* coro_spawn (void (*fn)(void*), void*);
void coro_yield (coroutine_t*);
void coro_release (coroutine_t*);
bool coro_finished(coroutine_t const*);
bool coro_active (coroutine_t const*);It is (supposedly, I presume and hope) also easy to make it thread safe (I haven’t because I fear my creation more than my creator at this point) by slapping some mutexes or whatever.
The general idea hinges on the following definitions:
struct coroutine {
void* rsp;
void* base;
size_t size;
bool active;
bool finished;
};
static thread_local void* thread_rsp = NULL;
static thread_local coroutine_t* current_coroutine = NULL;Like before, rsp tracks the current stack position at
the time of (either creation or, more importantly) last control
yielding. base (and size… thank you, Linus,
very cool) allows us to unmap the whole stack for the coroutine.
active, finished are state flags and could be rolled into
an enum if I had been well-rested when I banged this one
out (I’ll do it later ofc).
Being active or finished prevents a
coroutine from being yielded to, whereas you cannot be in an
execution context while finished and you cannot yield
from if not active. The thread_rsp,
however, is important later.
The coro_spawn hasn’t changed much:
coroutine_t* coro_spawn(void(*fn)(void*), void* dat) {
static const int prot = PROT_READ | PROT_WRITE;
static const int flag = MAP_PRIVATE | MAP_ANONYMOUS | MAP_GROWSDOWN;
static const int size = 1 << 20; // You could, of course, change this
void* base = mmap(NULL, size, prot, flag, -1, 0);
void** rsp = base;
*(--rsp) = _coro_die;
*(--rsp) = fn;
*(--rsp) = dat; // %rdi
*(--rsp) = 0; // %rbp
*(--rsp) = 0; // %rbx
*(--rsp) = 0; // %r12
*(--rsp) = 0; // %r13
*(--rsp) = 0; // %r14
*(--rsp) = 0; // %r15
*(--rsp) = 0; // $pad
coroutine_t *ptr = malloc(sizeof(*ptr));
*ptr = (coroutine_t) {.rsp = rsp, .base = base, .size = size, 0};
return ptr;
}but for the sake of completeness, I’ll walk you through it: we map a
region of memory that grows downwards as we page-fault, here
mmap actually returns a pointer to the byte past the last
byte of the first (ie. highest) mapped page. We then populate the stack
in accordance to the application binary interface (ABI), namely, we
populate the return address (_coro_die), then we push
another return address (fn) and store dummy values for all
callee-saved registers. The only stack value that is not a dummy is
dat, which corresponds to rdi — which, when
popped, will assign the value dat to the first argument of
fn according to the calling convention.
The _coro_die function changed a bunch from the previous
iteration, as we were doing funky round-robin and automatically reaping
of coroutines at that point in time. Instead, we do the following
(pretty simple) cleanup:
__attribute__((naked))
void _coro_die() {
static void* discard[16] = {0};
coroutine_t* coro = current_coroutine;
assert(coro);
assert(coro->active);
assert(coro->finished == false);
coro->active = false;
coro->finished = true;
current_coroutine = NULL;
_coro_return_into(thread_rsp, &discard[8]);
}Since when fn actually returns, _coro_die
will be “returned into” (ie. the program counter will actually jump to
the beginning of the function) we expect the currently active coroutine
to be the expiring returned-from function. We assert it is active and
not yet finished, we set those conditions, unset the
current_coroutine and perform the “return-into” step with a
dummy store for the stack. The value of 16 is quite arbitrary.
The coro_yield function, now taking a coroutine handle,
is an honest to God C function with a single funky call to
_coro_return_into:
void coro_yield(coroutine_t* coro) {
void **checkpoint;
void *rsp;
if (coro) {
assert(coro->active == false);
assert(coro->finished == false);
rsp = coro->rsp;
if (current_coroutine) {
assert(current_coroutine->active);
current_coroutine->active = false;
checkpoint = ¤t_coroutine->rsp;
} else {
checkpoint = &thread_rsp;
}
current_coroutine = coro;
current_coroutine->active = true;
} else if (current_coroutine) {
assert(current_coroutine->active);
rsp = thread_rsp;
checkpoint = ¤t_coroutine->rsp;
current_coroutine->active = false;
current_coroutine = NULL;
} else {
return;
}
_coro_return_into(rsp, checkpoint);
}it simply determines which rsp to restore so the return
into function trick works, and where to store the current stack pointer
into prior to yielding execution. The cases reflect the following
scenarios: we are yielding into an actual coroutine handle, which must
be inactive and unfinished; if we are in a coroutine execution context,
we deactivate the yielding coroutine and select its rsp to
be the saved address; if we do not have an active coroutine, we must be
in a main execution context, so we save into thread_rsp
instead.
If, however, we are not yielding into an actual coroutine handle, it must mean we are simply yielding to whomever; the flexibility of the design is that we return to the main execution context for that given thread, in particular every thread gets to schedule coroutines execution as they see fit, all without breaking the expected behaviour that deliberately yielding into a coroutine will transfer control to that coroutine.
In particular, every coroutine that spawns other coroutines, can itself also be responsible for scheduling those coroutines without the rest of the program being aware of those details.
This leaves the elusive _coro_return_into function to be
defined:
__attribute__((naked))
void _coro_return_into(void* rsp, void** checkpoint) {
(void) rsp; (void) checkpoint;
__asm__("push %rdi");
__asm__("push %rbp");
__asm__("push %rbx");
__asm__("push %r12");
__asm__("push %r13");
__asm__("push %r14");
__asm__("push %r15");
__asm__("push $777");
__asm__("mov %rsp, (%rsi)");
__asm__("mov %rdi, %rsp");
__asm__("pop %r15");
__asm__("pop %r15");
__asm__("pop %r14");
__asm__("pop %r13");
__asm__("pop %r12");
__asm__("pop %rbx");
__asm__("pop %rbp");
__asm__("pop %rdi");
__asm__("ret");
}it is the only function with inline assembly, and the source of all
undefined behavior and unicorn vomit in the library: The contract is
simple: we push the callee-saved registers onto the stack, with the
appropriate padding; we store the resulting stack pointer into the
region pointed to by rsi, which is the second pointer
argument checkpoint; we then set the stack pointer register
rsp to the rsp parameter of the function,
which was passed through the rdi register. We then pop the
registers, popping our padding into r15 prior to restoring
it to its original value, and then restoring the remaining
registers.
By performing ret, we force the function to pop from the
stack and jump into code at the popped instruction pointer. When we
first coro_yield into a coroutine, this value is
fn as set by coro_spawn, meaning we jump into
the beginning of that function.
When we coro_yield from a coroutine, we
performed a call instruction to _coro_return_into, meaning
the compiler was forced to push a return address, that is just past the
call instruction, into the stack; we then proceed to push some stuff
into the stack. When we coro_yield again into that
coroutine we will pop the stuff we pushed, meaning the return address,
just past the call _coro_return_into instruction, will be
at the top of the stack. So by performing ret, we will
resume execution exactly past that point, which is at the end of the
function coro_yield. Meaning we will resume coroutine
execution “as if” coro_yield was a no-op (aside from the
side-effects that are the whole point of coroutines).
Of course, the compiler is truly clueless about any of this stack yoga and program counter sodomy. So it’s anyone’s guess what will actually happen when you run this in production.
But the principle still applies:It works on my machine
Finally, coro_release, coro_finished and
coro_active are pretty much obvious, you
munmap the base at your leisure.
Aside from the previously mentioned state flags being rolled into an
enumeration, the thread safety feature should (hopefully) not be a
| dream considering we can add some mutexes when altering
or checking the state of the coroutines, and that should (it’s 2am, I
won’t commit to a strong stance about the interactions of parallelism
with advanced stack buggery) take care of any pain that isn’t
self-inflicted.
This was a pain to get to work, but when it became clear that I could do the saving and restoring in a single function, the pieces started to fall into place quite quickly. Just another example of how the right partition of responsibilities is required to even start an admissible implementation. The biggest challenge in these sorts of dark and violent rituals is the fact that even returning or pushing return addresses are, themselves, responsibilities in this level of programming. It is easy to try to overengineer a solution when, really, minimizing the points you have to touch the registers, is optimal.
As for the code, a lightly tested (ie. it stopped SEGFAULTing and behaved as expected in a single test compiled without optimizations :D) implementation is available under GPLv3 or later. Let me know if you need another license for whatever reason.