Tutorial: Memory Layout for Machine Code and Exit Stubs

Introduction

This article explains some details of machine code (mcode) memory management performed by LuaVela compiler.

Memory Area for mcode: Overview

Each time the platform detects a hot execution path in the Lua code, it tries to compile it. This path is also called a trace. Process of compilation consists of translating Lua byte code of the trace to an intermediate representation (IR; phase also known as “trace recording”), running optimizations on generated IR and finally assembling IR, i.e. transforming it to the actual machine code, which is the ultimate goal of all JIT compilation efforts. The entire machine code generated by LuaVela is kept in jit_State.mcarea. mcarea is a singly linked list of memory chunks (each being LJ_PAGESIZE-aligned), each chunk has following structure:

+---------------------------+ <-- lowest address
|link to the next area chunk|
+---------------------------+
|mcbot: area bottom         |
+---------------------------+
|##### area red zone #######|
+---------------------------+
|///////////////////////////|
|///////free space//////////|
|///////////////////////////|
+---------------------------+
|mctop: area top            |
+---------------------------+
|\\\possible padding crap\\\|
+---------------------------+ <-- highest address

mcbot and mctop grow towards each other, mcbot grows from lower to higher addresses, mctop grows from higher to lower addresses.

Memory at mctop and higher is occupied by machine code of traces:

+---------------------------+
|\\\\\\lower addresses\\\\\\|
+---------------------------+
|mctop                      |
+---------------------------+ <-- trace2->mcode
|/////mcode for trace2//////|
+---------------------------+ <-- trace1->mcode
|/////mcode for trace1//////|
+---------------------------+

Each time a new trace is about to be assembled, its array of IR instructions is processed from bottom to top and output machine code is copied to mcarea shifting mctop towards lower addresses. After that the processed trace which originally resided partly in jit_State.cur, partly in other jit_State buffers (e.g.``jit_State.irbuf``) is compacted to a separate GCtrace object, but mcode pointer of the new GCtrace object still points to the memory owned by jit_State.

Memory at mcbot and lower is occupied by trace-independent machine code which in x86-64 case consists of exit stubs only. Exit stubs are divided into groups, addresses of groups are kept in jit_State.exitstubgroup.

Group 0 holds exit stubs 0..31, group 1 holds exit stubs 32..63 etc. Each group has following structure:

exit_stub_0:
push i8                   ; 2 bytes: 1 for pushi8 opcode + 1 for immediate value
jmp exit_handler_prologue ; 2 bytes: 1 for short jump    + 1 for 8-bit relative offset
exit_stub_1:              ; ...
push i8
jmp exit_handler_prologue
...
exit_handler_prologue:
push i8
mov r11, i64              ; This i64 is the address of the dispatch table
mov [rsp+0x10], r11
mov r11, @lj_vm_exit_handler
jmp r11

Maximum number of stubs per group is limited because jump target has to be no longer than 1 byte “jmp exit_handler_prologue”. Stub groups can be made trace-independent because whenever a new trace starts execution, it records its ID to global_State.vmstate, so once we know location of global_State (and we do because it is saved in exit_handler_prologue, see below) we can treat trace IDs and exit numbers independently.

Stub groups are written to mcarea “on demand”, i.e. when we assemble a trace with n exits we create number of groups which will be enough to hold n exits. When we assemble the next trace with m <= n exits, nothing is done as the code is trace-independent and already assembled.

If m > n, we add groups(s) needed to hold m - n stubs (stub n, n + 1, ... m - 1).

The code in lj_vm_exit_handler restores exit number, trace number and pushes regs on stack to form ExitState struct and eventually calls

call extern lj_trace_exit(jit_State *J, ExitState *ex)

To restore trace number, dispatch table address is needed (its value saved exit_handler_prologue). It is implemented like this:

lea rbp, [rsp+88] ; magic offset is explained by massive pushing regs on the stack
; -- now rbp point at the slot which was addressed as [rsp+0x10]
;in exit_handler_prologue