Coroutines

In the past I’ve talked about C’s janky lambdas, ie. local anonymous functions with a cutesy interface using macros. This is a continuation of that post, not because I want to talk about lambdas, but because we are doing things you’re not supposed to in C.

Some time ago I was watching Tsoding’s (I highly recommend every vod) ¹ stream and came across the idea of raw dogging the program counter and implementing coroutines by hand in C. Having been successfully nerd-sniped, I decided to see if I could roll my own implementation just referencing what Alexey had used: Musl’s implementation of setjmp longjmp (for x86_64).

In this blog post, I’d like for us to reconstruct my implementation from first principles;

What are coroutines anyway?

Coroutines are subroutines’ more charismatic and sociable cousins; instead of arranging themselves in rigid hierarchies, they are all about hippie stuff like cooperation. In less allegorical terms, coroutines are a unit of computation much like a function, but which can be suspended and resumed, yielding control of the execution of the program to other cooperating coroutines. Resumption of control implies that the state of the coroutine prior to its suspension has a degree of permanence.

Generators can be thought of as being backed by a coroutine, for instance. They also look like “green threads”, wherein the program manages its own pooling of tasks to perform, without OS threads getting involved a priori.

We are interested in stackful coroutines, because they are more interesting, and because C++ can’t do them yet as far as I know. A coroutine system is stackful when voluntary suspension (yielding) of control can occur deep inside a coroutine’s call stack, as opposed to only on the frame of the coroutine itself. These are particularly interesting for complex generators, such as parsers which by necessity produce complex trees, which sometimes have to be checked for correctness: maintaining the state of the parser for it to backtrack on a particular parsing of a token stream is inherently useful.

Setjmp and Longjmp

/* Copyright 2011-2012 Nicholas J. Kain, licensed under standard MIT license */
setjmp: 
	mov %rbx,(%rdi)         /* rdi is jmp_buf, move registers onto it */
	mov %rbp,8(%rdi)
	mov %r12,16(%rdi)
	mov %r13,24(%rdi)
	mov %r14,32(%rdi)
	mov %r15,40(%rdi)
	lea 8(%rsp),%rdx        /* this is our rsp WITHOUT current ret addr */
	mov %rdx,48(%rdi)
	mov (%rsp),%rdx         /* save return addr ptr for new rip */
	mov %rdx,56(%rdi)
	xor %eax,%eax           /* always return 0 */
	ret

longjmp:
	xor %eax,%eax
	cmp $1,%esi             /* CF = val ? 0 : 1 */
	adc %esi,%eax           /* eax = val + !val */
	mov (%rdi),%rbx         /* rdi is the jmp_buf, restore regs from it */
	mov 8(%rdi),%rbp
	mov 16(%rdi),%r12
	mov 24(%rdi),%r13
	mov 32(%rdi),%r14
	mov 40(%rdi),%r15
	mov 48(%rdi),%rsp
	jmp *56(%rdi)           /* goto saved address without altering rsp */

Let’s first look at setjmp. If you’re not familiar with assembly code, not only do I envy you, I will work towards worsening your life by explaining it. To wit, %rdi corresponds to the register the first (in this case only) parameter the function receives from the C call convention in x86_64. The setjmp gets a pointer to a data structure called the jump buffer. It is not a public interface we get to poke at, but by inspection we can see what it comprises: of RBX (the Roblox™©® register); RBP, the stack frame base pointer register; R12–R15 which are other non-volatile registers; then the RSP stack pointer registers (the lea stuff just removes the bottom of the stack which corresponds to the return address of that setjmp call, then the actual return address, and the RDX register.

The key takeaway is we save a bunch of registers to memory, which we will later restore in longjmp by referring to those values. The important bit is that we may restore the registers and jump to the saved return address, largely making it look like we returned from the original call to setjmp but with EAX set to something other than 0, and therefore returning an indicator that we actually jumped there.

If you squint a bit you might feel like this offers some justification for perverting and mangling C, as well as the means to do so.

My implementation

I haven’t tested it very extensively, but this header-only implementation seems to work-on-my-machine™:

#ifndef CORO2_H
#define CORO2_H

#include <stdbool.h>

void coro_yield();
void coro_spawn(void (*fn)(void*), void*);
bool coro_poll();

#endif
#ifdef	CORO2_IMPLEMENTATION

#include <stddef.h>
#include <stdlib.h>
#include <sys/mman.h>

#define auto_t __auto_type
#define typeof __Typeof
#define apush(vec, val) ({\
	auto_t x = &(val);\
	auto_t v = &(vec);\
	if (v->count == v->capacity) {\
		v->capacity = v->capacity? 2 * v->capacity: 1;\
		v->items = realloc(v->items, sizeof(*x) * v->capacity);\
		if (v->items == NULL)\
			abort();\
	}\
	v->items[v->count++] = *x;\
})

#ifndef CORO_STACK_SIZE
#define CORO_STACK_SIZE (4 << 20 /* 4MB */)
#endif

struct coroutine {
	void *rsp, *base;
	bool dying;
};

struct _global_state {
	struct coroutine* items;
	size_t capacity, count, dying, current;
};

static struct _global_state _ = {0};
static struct coroutine* current(void);
static void   restore(void*);
static void   reap(void);
static void   die(void);
static void   save_rsp(void* rsp);
static void   jmp2next(void);

__attribute__((constructor)) static void coro_init() {
	apush(_, (struct coroutine){0});
}

__attribute__((destructor)) static void coro_fini() {
	free(_.items);
	_ = (struct _global_state) {0};
}

__attribute__((naked)) void coro_yield() {
	__asm__("push %rdi");
	__asm__("push %rbp");
	__asm__("push %rbx");
	__asm__("push %r12");
	__asm__("push %r13");
	__asm__("push %r14");
	__asm__("push %r15");
	__asm__("mov  %rsp, %rdi");
	__asm__("jmp *%0"::"r" (save_rsp));
}

void coro_spawn(void (*fn)(void*), void* arg) {
	static int prot = PROT_READ | PROT_WRITE;
	static int flag = MAP_PRIVATE | MAP_ANONYMOUS | MAP_GROWSDOWN;
	void* base = mmap(NULL, CORO_STACK_SIZE, prot, flag, -1, 0);
	void** rsp = base;
	*(--rsp) = die;
	*(--rsp) = fn;
	*(--rsp) = arg; // %rdi
	*(--rsp) = 0;   // %rbp");
	*(--rsp) = 0;   // %rbx");
	*(--rsp) = 0;   // %r12");
	*(--rsp) = 0;   // %r13");
	*(--rsp) = 0;   // %r14");
	*(--rsp) = 0;   // %r15");	
	apush(_, ((struct coroutine){.rsp = rsp, .base = base}));
}

bool coro_poll() {
	return _.count > 1;
}

void die() {
	if (_.count == 0)
		abort();
	current()->dying = true;
	struct coroutine tmp = *current();
	*current() = _.items[_.count - 1];
	_.items[_.count - 1] = tmp;
	
	_.count -= 1;
	if (_.count) {
		_.dying += 1;
		_.current += 1;
		_.current %= _.count;
		restore(current()->rsp);
	} else {
		// munmap'ing our own stack sounds like an awful idea
		// just exit.
		exit(EXIT_SUCCESS);	
	}
}

__attribute__((naked)) void restore(void* rsp) {
	__asm__ __volatile__(""::"m" (rsp));
	__asm__             ("mov %rdi, %rsp"); (void) rsp; // probably illegal :)
	// we can now reap the dying stacks, as our stack isn't dying
	reap();
	__asm__("pop %r15");
	__asm__("pop %r14");
	__asm__("pop %r13");
	__asm__("pop %r12");
	__asm__("pop %rbx");
	__asm__("pop %rbp");
	__asm__("pop %rdi");
	__asm__("ret");
}

void reap() {
	for (size_t i = _.count; i < _.count + _.dying; i++) {
		munmap(_.items[i].base, CORO_STACK_SIZE);
		_.items[i] = (struct coroutine) {0};
	}
	_.dying = 0;
}

struct coroutine* current() {
	if (_.count == 0)
		abort();
	return &_.items[_.current];
}

void save_rsp(void* rsp) {
	if (_.count == 0)
		abort();
	current()->rsp = rsp;
	jmp2next();
}

void jmp2next(void) {
	if (_.count == 0)
		abort();
	_.current += 1;
	_.current %= _.count;
	restore(current()->rsp);	
}

#undef	CORO2_IMPLEMENTATION
#endif

I haven’t really kept up with how Alexey’s changed his implementation. I remember suggesting to use mmap on the comment section but I don’t know if he ever read it, probably not.

You may find the source and an example on these links. You need to compile with something like

Excuse me, what am I looking at?

The C code itself it rather straight forward. What is a bit esoteric is the swathes of inline assembly, and the fact that depending on very mild changes, everything breaks. I basically got it to work once and everything after that was incremental and hard-fought. A past implementation I attempted to do from scratch for this blog post simply didn’t work and I don’t know exactly what I was doing differently from this implementation; changing things a bit made gcc pretty upset, as it reprimanded me:

which is at the same time VERY reasonable, and quite unfortunate. Since this isn’t a production system, we’ll ignore the fact that the compiler might very well optimize our code into nasal demons etc.

Breaking Down the Alphabet Soup

First thing to do with any library is tracing back its public interfaces to their implementations. Here we have three functions: coro_poll() is called to see if there are other coroutines that could be resumed, or if we are all alone. This is important because we don’t have a runtime to manage those coroutines for us, the way I’ve set it up uses main() to poll the coroutines, as it is the only coroutine that surely will not exit (as if it does, the program still terminates).

The concept of polling is perhaps as bit useless, because the responsibility of yielding execution control to peer routines is distributed, it just so happens main has to transfer said control, which puts it on the list of things that may be resumed, and therefore you kind of always have to coro_yield() so as to not squat the program counter.

The next public function in the scale of complexity is coro_spawn which works like a deferred call to a function. We set up its stack frame as if it had been called but we just never transfer control to it. With one key difference, the stack frame we create is has just been allocated on the heap using mmap, creating a virtual mapping that grows downwards as needed until the virtual memory manager gets fed up (when you encroach on an already mapped page, for instance) and segfaults you.

This means that when we transfer the control to that function, it will be as if it is our main function, at the bottom of its own separate and independent stack. With the same caveat that main has: it isn’t the true bottom of the stack, main is called by _start which does some C runtime stuff before calling main, and then performs the appropriate rites so that the program exits gracefully.

We have the same kind of thing happening, at the bottom of its stack, a coroutine will have a guard die function where its return address ought to be. When/if the routine returns, die will not be so much invoked but rather returned to (even though we never called the routine from die) and it will handle coroutine cleanup and switching to a different one if appropriate.

	*(--rsp) = die;
	*(--rsp) = fn;
	*(--rsp) = arg; // %rdi
	*(--rsp) = 0;   // %rbp");
	*(--rsp) = 0;   // %rbx");
	*(--rsp) = 0;   // %r12");
	*(--rsp) = 0;   // %r13");
	*(--rsp) = 0;   // %r14");
	*(--rsp) = 0;   // %r15");

But why pass all those registers through the stack? Well, it’s because in the process of switching between coroutine contexts, we save (what would normally be callee-saved) on the stack, then save the stack pointer itself on a global list of coroutine structures. When we restore/resume a particular coroutine, it takes from that stack and restores its own context before handing control back to the place that originally yielded it.

You should note that both in coro_yield and restore we do not, in fact, store any information regarding where to return to when we in fact restore the coroutine’s execution. This is because the compiler already takes care of that: when we call coro_yield, the compiler pushes the return address onto the stack, we then do our register preservation thing we partially lifted from setjmp.

But we never return from coro_yield, we jump ie. transfer control to a function that saves the current stack pointer in the current coroutine structure, advances the active/current coroutine, then restores its context. Context restoration involves undoing the register protection we performed when that particular coroutine yielded, at which point the stack is “as if” we had just just entered coro_yield, and the very next instruction is a ret, meaning we do a “return” with exactly the same registers (the ones that the calling convention says that matters, at least) and stack we had just after calling coro_yield.

This means that coro_yield “looks like” it does nothing at all, from the perspective of the function that yielded; it just went to sleep for a while, and just like pausing your conscience for a second, you can only observe it through the side-effects of things you did not observe.

Some more nuanced things is how the dying coroutines mark themselves as dying, and move themselves to “past” the end of the array of coroutine objects so as to not be resumed, when we transfer control to an active coroutine, in restore, we are entitled to reap the dying coroutines, as our coroutine isn’t dying so we won’t run the risk of munmapping our own stack from under our feet.

Caveats

Well, frankly, all of them. This seems to be plainly illegal, if GCC is anything to go by. It works but debugging is essentially impossible aside from actually stepping through the program on gdb and trying to find the exact moment you get an error; if you miss it, you’ll switch stacks and backtracing will literally not work right. gdb simply does not expect you to do something this cursed.

Similarly, valgrind gets very upset and crashes outright because of some hi-lo discrepancy that I don’t presume to understand ever.