When defeat is inevitable,
It is wisest to coro_yield

José Goudet Alvim

2025-12-02

Changelog

As regular readers (nullptr) may know, I am an avid azozing enjoyer. Having watched Alexey’s recent video I came to be quite dissatisfied with my coroutine implementation I had previously blogged about. I had mentioned in the past that assert shits the bed when called from within a coroutine context (actually I don’t think I did lol, but it does. It also upsets valgrind so much it dies of pure disgust). Alexey also had a similar problem.

The problem is: we save rdi, rbp, rbx, r12-15 because those are non-volatile (callee-saved) registers and we are transferring control into some foreign part of the binary “as if” we had returned from the yield call normally. The central issue seems to be assert actually care about stack alignment, and those registers are seven in number… which means we are 8 bytes off from the 16 byte alignment normally seen and amd64 calling convention. Yeah…

So this issue has been fixed, as suggested by a chatter in his live stream, who goes by codecat69, which is what happens when you have three Latex active characters in a row, I think.

The more relevant improvement that isn’t just lifted from a more talented programmer is our new interface. Prior art was the following:

void coro_yield();
void coro_spawn(void (*fn)(void*), void*);
bool coro_poll();

meaning you have no control over which coroutine you yield to, when and if its data is released, etc. instead it just handles the lifetimes automatically at the cost of this extra flexibility. Realizing I was programming in C and lifetime concerns are a matter of 𝓼𝓴𝓲𝓵𝓵 𝓲𝓼𝓼𝓾𝓮 — provided the API is good enough, if you think an API is bad, then that is also a 𝖘𝖐𝖎𝖑𝖑 𝖎𝖘𝖘𝖚𝖊.

Our current API is as follows:

typedef struct coroutine coroutine_t;
coroutine_t* coro_spawn   (void (*fn)(void*), void*);
void         coro_yield   (coroutine_t*);
void         coro_release (coroutine_t*);
bool         coro_finished(coroutine_t const*);
bool         coro_active  (coroutine_t const*);

It is (supposedly, I presume and hope) also easy to make it thread safe (I haven’t because I fear my creation more than my creator at this point) by slapping some mutexes or whatever.

Undefined Behavior Any%

The general idea hinges on the following definitions:

struct coroutine {
	void* rsp;
	void* base;
	size_t size;
	bool active;
	bool finished;
};

static thread_local void*               thread_rsp = NULL;
static thread_local coroutine_t* current_coroutine = NULL;

Like before, rsp tracks the current stack position at the time of (either creation or, more importantly) last control yielding. base (and size… thank you, Linus, very cool) allows us to unmap the whole stack for the coroutine. active, finished are state flags and could be rolled into an enum if I had been well-rested when I banged this one out (I’ll do it later ofc).

Being active or finished prevents a coroutine from being yielded to, whereas you cannot be in an execution context while finished and you cannot yield from if not active. The thread_rsp, however, is important later.

The coro_spawn hasn’t changed much:

coroutine_t* coro_spawn(void(*fn)(void*), void* dat) {
	static const int prot = PROT_READ   | PROT_WRITE;
	static const int flag = MAP_PRIVATE | MAP_ANONYMOUS | MAP_GROWSDOWN;
	static const int size = 1 << 20; // You could, of course, change this

	void* base = mmap(NULL, size, prot, flag, -1, 0);
	void** rsp = base;
	*(--rsp) = _coro_die;
	*(--rsp) = fn;
	*(--rsp) = dat; // %rdi
	*(--rsp) = 0;   // %rbp
	*(--rsp) = 0;   // %rbx
	*(--rsp) = 0;   // %r12
	*(--rsp) = 0;   // %r13
	*(--rsp) = 0;   // %r14
	*(--rsp) = 0;   // %r15	
	*(--rsp) = 0;   // $pad

	coroutine_t *ptr = malloc(sizeof(*ptr));
	*ptr = (coroutine_t) {.rsp = rsp, .base = base, .size = size, 0};
	return ptr;
}

but for the sake of completeness, I’ll walk you through it: we map a region of memory that grows downwards as we page-fault, here mmap actually returns a pointer to the byte past the last byte of the first (ie. highest) mapped page. We then populate the stack in accordance to the application binary interface (ABI), namely, we populate the return address (_coro_die), then we push another return address (fn) and store dummy values for all callee-saved registers. The only stack value that is not a dummy is dat, which corresponds to rdi — which, when popped, will assign the value dat to the first argument of fn according to the calling convention.

The _coro_die function changed a bunch from the previous iteration, as we were doing funky round-robin and automatically reaping of coroutines at that point in time. Instead, we do the following (pretty simple) cleanup:

__attribute__((naked))
void _coro_die() {
	static void* discard[16] = {0};

	coroutine_t* coro = current_coroutine;	
	assert(coro);
	assert(coro->active);
	assert(coro->finished == false);
	coro->active = false;
	coro->finished = true;
	current_coroutine = NULL;
	_coro_return_into(thread_rsp, &discard[8]);
}

Since when fn actually returns, _coro_die will be “returned into” (ie. the program counter will actually jump to the beginning of the function) we expect the currently active coroutine to be the expiring returned-from function. We assert it is active and not yet finished, we set those conditions, unset the current_coroutine and perform the “return-into” step with a dummy store for the stack. The value of 16 is quite arbitrary.

The coro_yield function, now taking a coroutine handle, is an honest to God C function with a single funky call to _coro_return_into:

void coro_yield(coroutine_t* coro) {
	void **checkpoint;
	void  *rsp;
	if (coro) {
		assert(coro->active == false);
		assert(coro->finished == false);
		rsp = coro->rsp;
		if (current_coroutine) {
			assert(current_coroutine->active);
			current_coroutine->active = false;
			checkpoint = &current_coroutine->rsp;
		} else {
			checkpoint = &thread_rsp;
		}
		current_coroutine = coro;
		current_coroutine->active = true;
	} else if (current_coroutine) {
		assert(current_coroutine->active);
		rsp = thread_rsp; 
		checkpoint = &current_coroutine->rsp;
		current_coroutine->active = false;
		current_coroutine = NULL;
	} else {
		return;
	}

	_coro_return_into(rsp, checkpoint);
}

it simply determines which rsp to restore so the return into function trick works, and where to store the current stack pointer into prior to yielding execution. The cases reflect the following scenarios: we are yielding into an actual coroutine handle, which must be inactive and unfinished; if we are in a coroutine execution context, we deactivate the yielding coroutine and select its rsp to be the saved address; if we do not have an active coroutine, we must be in a main execution context, so we save into thread_rsp instead.

If, however, we are not yielding into an actual coroutine handle, it must mean we are simply yielding to whomever; the flexibility of the design is that we return to the main execution context for that given thread, in particular every thread gets to schedule coroutines execution as they see fit, all without breaking the expected behaviour that deliberately yielding into a coroutine will transfer control to that coroutine.

In particular, every coroutine that spawns other coroutines, can itself also be responsible for scheduling those coroutines without the rest of the program being aware of those details.

This leaves the elusive _coro_return_into function to be defined:

__attribute__((naked))
void _coro_return_into(void* rsp, void** checkpoint) {
	(void) rsp; (void) checkpoint;
	__asm__("push %rdi");
	__asm__("push %rbp");
	__asm__("push %rbx");
	__asm__("push %r12");
	__asm__("push %r13");
	__asm__("push %r14");
	__asm__("push %r15");
	__asm__("push $777");
	__asm__("mov  %rsp, (%rsi)");
	__asm__("mov  %rdi,  %rsp");
	__asm__("pop  %r15");
	__asm__("pop  %r15");
	__asm__("pop  %r14");
	__asm__("pop  %r13");
	__asm__("pop  %r12");
	__asm__("pop  %rbx");
	__asm__("pop  %rbp");
	__asm__("pop  %rdi");

	__asm__("ret");
}

it is the only function with inline assembly, and the source of all undefined behavior and unicorn vomit in the library: The contract is simple: we push the callee-saved registers onto the stack, with the appropriate padding; we store the resulting stack pointer into the region pointed to by rsi, which is the second pointer argument checkpoint; we then set the stack pointer register rsp to the rsp parameter of the function, which was passed through the rdi register. We then pop the registers, popping our padding into r15 prior to restoring it to its original value, and then restoring the remaining registers.

By performing ret, we force the function to pop from the stack and jump into code at the popped instruction pointer. When we first coro_yield into a coroutine, this value is fn as set by coro_spawn, meaning we jump into the beginning of that function.

When we coro_yield from a coroutine, we performed a call instruction to _coro_return_into, meaning the compiler was forced to push a return address, that is just past the call instruction, into the stack; we then proceed to push some stuff into the stack. When we coro_yield again into that coroutine we will pop the stuff we pushed, meaning the return address, just past the call _coro_return_into instruction, will be at the top of the stack. So by performing ret, we will resume execution exactly past that point, which is at the end of the function coro_yield. Meaning we will resume coroutine execution “as if” coro_yield was a no-op (aside from the side-effects that are the whole point of coroutines).

Of course, the compiler is truly clueless about any of this stack yoga and program counter sodomy. So it’s anyone’s guess what will actually happen when you run this in production.

But the principle still applies:

It works on my machine

Finally, coro_release, coro_finished and coro_active are pretty much obvious, you munmap the base at your leisure.

Further Improvements

Aside from the previously mentioned state flags being rolled into an enumeration, the thread safety feature should (hopefully) not be a | dream considering we can add some mutexes when altering or checking the state of the coroutines, and that should (it’s 2am, I won’t commit to a strong stance about the interactions of parallelism with advanced stack buggery) take care of any pain that isn’t self-inflicted.

Conclusion

This was a pain to get to work, but when it became clear that I could do the saving and restoring in a single function, the pieces started to fall into place quite quickly. Just another example of how the right partition of responsibilities is required to even start an admissible implementation. The biggest challenge in these sorts of dark and violent rituals is the fact that even returning or pushing return addresses are, themselves, responsibilities in this level of programming. It is easy to try to overengineer a solution when, really, minimizing the points you have to touch the registers, is optimal.

As for the code, a lightly tested (ie. it stopped SEGFAULTing and behaved as expected in a single test compiled without optimizations :D) implementation is available under GPLv3 or later. Let me know if you need another license for whatever reason.