Tutorial: Allocation Sinking Optimization

Warning

LuaVela developers disclaimer: This tutorial is a carbon copy of the missing LuaJIT wiki page.

Introduction

The Allocation Sinking and Store Sinking Optimization was added to LuaJIT 2.0 in July 2012. This page documents the rationale for this optimization, the algorithm used and its implementation. The resulting performance is discussed and compared to other languages, too.

The research and the development of this optimization has been sponsored by an anonymous corporate sponsor in 2012.

Although this optimization is based on other well-known techniques, the combination may be considered innovative. I (Mike Pall) hereby donate the related intellectual property to the public domain. If you find it useful, please use it and share it!

Rationale

Avoiding temporary allocations is an important optimization for high-level languages. Allocations are expensive and cause additional overhead since they need to be garbage collected later. The cost can be reduced with considerable effort (cf. the New Garbage Collector), but an eliminated allocation is always the cheapest option.

LuaJIT already eliminates many temporary allocations with multiple techniques: e.g. floating-point numbers aren’t boxed and the JIT compiler eliminates allocations for most immutable objects. However, many allocations remain, especially in OO-heavy workloads.

The traditional techniques to avoid temporary allocations are escape analysis and scalar replacement of aggregates (SRA). Escape analysis verifies that an object doesn’t escape through any code path. But code generated for dynamic languages has many fallback paths, e.g. to handle dynamic typing. Sadly, this means that almost any object inevitably escapes through at least one of these uncommon code paths. This makes the traditional techniques ineffective for dynamic languages.

Motivating Example

For explanatory purposes, the following synthetic example has an explicit code path where the allocation escapes to. Of course, the same underlying logic applies for implicit code paths caused by the dynamic language semantics.

local x, z = 0, nil
for i=1,100 do
  local t = {i}
  if i == 90 then
    z = t
  end
  x = x + t[1]
end
print(x, z[1])

Although the uncommon code path for i == 90 is only taken once, traditional escape analysis would fail, because t does escape to z at one point in the control flow.

[Granted, no sane programmer would write such code. But it may arise indirectly as a consequence of higher language abstractions, e.g. when operating on aggregates. See below for some practical examples.]

Let’s take a look at the IR and the snapshots generated by LuaJIT with the sinking optimization disabled (-O-sink). Only the looping part of the trace is shown:

....        SNAP   #4   [ ---- 0009 ---- 0010 ---- ---- 0010 ]
0012 ------ LOOP ------------
0013 >  tab TNEW   #3    #0
0014    p32 FLOAD  0013  tab.array
0015    p32 AREF   0014  +1
0016    num CONV   0010  num.int
0017    num ASTORE 0015  0016
....        SNAP   #5   [ ---- 0009 ---- 0010 ---- ---- 0010 0013 ]
0018 >  int NE     0010  +90
0019  + num ADD    0016  0009
0020  + int ADD    0010  +1
....        SNAP   #6   [ ---- 0019 ---- ]
0021 >  int LE     0020  +100
0022    int PHI    0010  0020
0023    num PHI    0009  0019

As we can see, the table allocation TNEW in instruction 0013 escapes to the snapshot for side exit #5. Much more interesting is to see that the load t[1] has been eliminated. The stored value i (the left PHI operand at 0010, not shown) has been forwarded through the integer-to-number conversion (0016) and then added to x in instruction 0019. x itself escapes through side exit #6, but the allocation is dead before the next loop iteration.

[Note: the code path starting at side exit #5 is never compiled, since the side exit is not taken often enough. What happens when a side trace is compiled is shown below.]

The solution to the problem discussed above starts with an observation: there are no loads left for the temporary object! Thanks to advanced alias analysis, store-to-load-forwarding in LuaJIT is generally quite effective at removing all loads from temporary objects: they are simply replaced with the corresponding stored values.

The only references to a temporary allocation are then the stores themselves (via the AREF and the FLOAD in the above example). However, these stores are pointless if the allocation is not used later on in the fast code path. The allocation is thus only created for consistency with the other code paths, but performs no real function for the fast path.

The basic idea is now to move the temporary allocation where it’s really needed. This is ‘code motion’ in compiler terminology. Since we want to move it to a side path of the code, a more precise term is ‘sinking’.

Here’s the above example again, but with store-to-load-forwarding and sinking applied by hand on the right side:

You can easily verify that both produce the exact same output. But the IR for the hand-optimized Lua code is very different:

....        SNAP   #4   [ ---- 0005 ---- 0006 ---- ---- 0006 ]
0008 ------ LOOP ------------
....        SNAP   #5   [ ---- 0005 ---- 0006 ---- ---- 0006 ]
0009 >  int NE     0006  +90
0010    num CONV   0006  num.int
0011  + num ADD    0010  0005
0012  + int ADD    0006  +1
....        SNAP   #6   [ ---- 0011 ---- ]
0013 >  int LE     0012  +100
0014    int PHI    0006  0012
0015    num PHI    0005  0011

The allocation and the related store is gone from the fast path of the loop, which means it’ll run much faster!

The Goal

The main innovation of the approach described above is to combine store-to-load-forwarding with store sinking and allocation sinking. This is highly effective in avoiding temporary allocations in the fast paths, even under the presence of many uncommon paths where the temporary object may escape to.

The combination of the two optimizations has the same effect as scalar replacement of aggregates (SRA), but it’s applicable in more contexts. This approach is most effective for dynamic languages, but may be successfully applied elsewhere, when the classic techniques fail.

The goal is to automatically perform the above optimization, whenever feasible and profitable.

Note this optimization completely eliminates the allocation from the fast path. It does not turn a heap allocation into a stack allocation. The stored values are still live, usually in registers and sometimes in spill slots. However, the layout of spill slots does not need to match the layout of an allocated object (nor is it suitable to be passed to a C function expecting such an object).

The Missing Details

The above example is greatly simplified, of course. In practice, an allocation may escape to more than one snapshot or to more than one stack slot in the same snapshot. There may be stores both before and after a snapshot, which may even overwrite each other’s values. And, most importantly, temporary allocations may be held in loop-carried variables, which further complicates the analysis.

Sinking an allocation that escapes to a snapshot means extra work for the data-driven exit handler: the allocation needs to be unsunk, i.e. created on-the-fly and the related stores need to be performed, too.

Similarly, a side trace may need to be compiled corresponding to a snapshot with a sunk allocation: the allocation and the related stores need to be replayed in the side trace. For a truly general solution, a sunk allocation that’s replayed in a side trace may be sunk again, of course.

The same approach needs to work for all types of allocations, not just for Lua tables. Temporary allocations of mutable and immutable FFI objects are common, too.

Examples

The exact algorithm used for this optimization is explained in the implementation section below. For better understanding, this section shows the results of the optimization for a range of examples.

Motivating Example Again

Here’s the motivating example again plus the IR and the machine code for the loop part with the sinking optimization turned on:

local x, z = 0, nil
for i=1,100 do
  local t = {i}
  if i == 90 then
    z = t
  end
  x = x + t[1]
end
print(x, z[1])
....              SNAP   #4   [ ---- 0009 ---- 0010 ---- ---- 0010 ]
0012 ------------ LOOP ------------
0013  {sink}  tab TNEW   #3    #0
0014          p32 FLOAD  0013  tab.array
0015          p32 AREF   0014  +1
0016 xmm6     num CONV   0010  num.int
0017  {0013}  num ASTORE 0015  0016
....              SNAP   #5   [ ---- 0009 ---- 0010 ---- ---- 0010 0013 ]
0018       >  int NE     0010  +90
0019 xmm7   + num ADD    0016  0009
0020 rbp    + int ADD    0010  +1
....              SNAP   #6   [ ---- 0019 ---- ]
0021       >  int LE     0020  +100
0022 rbp      int PHI    0010  0020
0023 xmm7     num PHI    0009  0019
->LOOP:
394cffd0  xorps xmm6, xmm6
394cffd3  cvtsi2sd xmm6, ebp
394cffd7  cmp ebp, +0x5a
394cffda  jz 0x394c0024 ->5
394cffe0  addsd xmm7, xmm6
394cffe4  add ebp, +0x01
394cffe7  cmp ebp, +0x64
394cffea  jle 0x394cffd0 ->LOOP
394cffec  jmp 0x394c0028 ->6

As you can see, the allocation and the store have been sunk. No register or spill slot is allocated to them: it shows {sink} or the reference of the allocation corresponding to the store instead. The machine code for the loop contains only the conversion, the i == 90 check, the summation of x and the loop increment plus boundary check.

[Note: the xorps avoids a partial-register-write stall for the SSE2 conversion instruction.]

Re-Sinking

Here’s an example that shows how a sunk allocation is sunk again in a side trace:

local z = nil
for i=1,200 do
  local t = {i}
  if i > 100 then
    if i == 190 then z = t end
  end
end
print(z[1])

Here’s the IR for the first trace (the } shows sunk instructions):

....        SNAP   #0   [ ---- ]
0001    int SLOAD  #2    CI
0002 }  tab TNEW   #3    #0
0003    p32 FLOAD  0002  tab.array
0004    p32 AREF   0003  +1
0005    num CONV   0001  num.int
0006 }  num ASTORE 0004  0005
....        SNAP   #1   [ ---- ---- 0001 ---- ---- 0001 0002 ]
0007 >  int LE     0001  +100
0008  + int ADD    0001  +1
....        SNAP   #2   [ ---- ---- ]
0009 >  int LE     0008  +200
....        SNAP   #3   [ ---- ---- 0008 ---- ---- 0008 ]
0010 ------ LOOP ------------
0011 }  tab TNEW   #3    #0
0012    p32 FLOAD  0011  tab.array
0013    p32 AREF   0012  +1
0014    num CONV   0008  num.int
0015 }  num ASTORE 0013  0014
....        SNAP   #4   [ ---- ---- 0008 ---- ---- 0008 0011 ]
0016 >  int LE     0008  +100
0017  + int ADD    0008  +1
....        SNAP   #5   [ ---- ---- ]
0018 >  int LE     0017  +200
0019    int PHI    0008  0017

A side trace is spawned, the sunk allocation and the sunk store is replayed before the first snapshot. Then it’s sunk again:

0001    int SLOAD  #2    PI
0002 }  tab TNEW   #3    #0
0003    p32 FLOAD  0002  tab.array
0004    p32 AREF   0003  +1
0005    num CONV   0001  num.int
0006 }  num ASTORE 0004  0005
....        SNAP   #0   [ ---- ---- 0001 ---- ---- 0001 0002 ]
0007 >  nil GCSTEP
....        SNAP   #1   [ ---- ---- 0001 ---- ---- ---- 0002 ]
0008 >  int NE     0001  +190
0009    int ADD    0001  +1
....        SNAP   #2   [ ---- ---- ]
0010 >  int LE     0009  +200
0011    num CONV   0009  num.int
....        SNAP   #3   [ ---- ---- 0011 ---- ---- 0011 ]

[An explicit GCSTEP instruction is needed after the first snapshot to prevent implicit GC steps before the first snapshot. It’s really a no-op here, since no allocations are performed.]

Point Class

Here’s a practical example of a higher language abstraction: a (simplified) ‘point’ class. It needs to create temporary objects as part of arithmetic operations on point objects.

Point Class With Lua Tables

First, the version for plain Lua tables:

local point
point = {
  new = function(self, x, y)
    return setmetatable({x=x, y=y}, self)
  end,
  __add = function(a, b)
   return point:new(a.x + b.x, a.y + b.y)
  end,
}
point.__index = point
local a, b = point:new(1.5, 2.5), point:new(3.25, 4.75)
for i=1,100000000 do a = (a + b) + b end
print(a.x, a.y)

It creates two temporary objects per iteration, but LuaJIT is able to optimize them away. Here’s the machine code for the loop part:

->LOOP:
394cffe0  addsd xmm6, xmm1
394cffe4  addsd xmm7, xmm0
394cffe8  addsd xmm6, xmm1
394cffec  addsd xmm7, xmm0
394cfff0  add ebp, +0x01
394cfff3  cmp ebp, 0x05f5e100
394cfff9  jle 0x394cffe0 ->LOOP
394cfffb  jmp 0x394c0028 ->6

[Note: floating-point arithmetic is not associative. The compiler is not allowed to transform (a+b)+b into a+(b+b) and then hoist out the computation of b+b. It really needs two FP additions per component for a total of four SSE2 addsd instructions per loop iteration.]

Point Class With FFI cdata struct

local ffi = require("ffi")
local point
point = ffi.metatype("struct { double x, y; }", {
  __add = function(a, b)
   return point(a.x + b.x, a.y + b.y)
  end
})
local a, b = point(1.5, 2.5), point(3.25, 4.75)
for i=1,100000000 do a = (a + b) + b end
print(a.x, a.y)

The machine code inside the loop is identical to the above example. A similar example with immutable ‘complex’ cdata objects generates the same code, too.

Point Class in C++ and Java

For comparison, here’s the same point class implemented in C++:

#include <iostream>

class Point {
public:
  Point(double x, double y) : x(x), y(y) { }
  Point operator+(const Point &b) const {
    return Point(x+b.x, y+b.y);
  }
  double x, y;
};

int main()
{
  int i;
  Point a(1.5, 2.5), b(3.25, 4.75);
  for (i = 0; i < 100000000; i++)
    a = (a + b) + b;
  std::cout << a.x << ' ' << a.y << '\n';
  return 0;
}

Note that this code is passing aggregate values around and not actually allocating objects. However, the C++ compiler still has to perform SRA for optimal results. Not surprisingly, the machine code for the loop is basically the same as the code LuaJIT generates.

[We’ll ignore auto-vectorization/auto-simdization for the moment.]

And here’s the same point class implemented in Java:

public class Point {
  public double x, y;
  public Point(double x0, double y0) { x = x0; y = y0; }
  public Point add(Point b) { return new Point(x + b.x, y + b.y); }

  public static void main(String args[])
  {
    int i;
    Point a = new Point(1.5, 2.5);
    Point b = new Point(2.25, 4.75);
    for (i = 0; i < 100000000; i++)
      a = a.add(b).add(b);
    System.out.println(a.x + " " + a.y);
  }
}

JVM/Hotspot 1.7 is unable to eliminate the allocations. Adding the option -XX:+DoEscapeAnalysis doesn’t change anything. Moving the loop to a separate method or using an outer loop doesn’t help either.

Point Class Benchmarks

Here’s the runtime for the point class in seconds (YMMV). Lower is better:

Time Point object VM/Compiler
140.0 Lua table Lua 5.1.5
26.9 Lua table LuaJIT 2.0 git HEAD -O-sink
10.9 FFI struct LuaJIT 2.0 git HEAD -O-sink
0.2 Lua table LuaJIT 2.0 git HEAD -O+sink
0.2 FFI struct LuaJIT 2.0 git HEAD -O+sink
0.2 C++ class GCC 4.4.3 -O2 (or -O3)
1.2 Java class JVM/Hotspot 1.7.0_05

LuaJIT is around 700 times faster than plain Lua and it’s the same speed as C++ – for this example.

JVM 1.7 is unable to eliminate the allocations, but has a fast allocator and garbage collector. Still, the JVM is around 6 times slower than LuaJIT or C++ for this example.

Note: this cannot be extrapolated to other code, of course. And it does not lead to a generalizable statement about the relative performance of Lua, LuaJIT, Java or C++.

[Yes, a runtime of 0.2 seconds is too low for precise results. But for this simple example, scaling the loop iterations up yields consistent results. The code generated by LuaJIT and C++ really has the same performance.]

Implementation

Different parts of the JIT-compiler are involved in handling allocation sinking and store sinking. The following subsections describe the required additions and changes to the implementation.

SINK Optimization Pass

The SINK optimization pass is only invoked, if:

  1. The SINK optimization and the FOLD, CSE, DCE and FWD optimizations are turned on. This is the case for -O3, which is the default.
  2. The IR of the current trace holds at least one TNEW, TDUP, CNEW or CNEWI instruction.

This pass traverses the IR of the current trace in two phases, very much like a classic mark & sweep garbage collection algorithm:

  • The mark phase marks all allocations that cannot be sunk.
  • The sweep phase tags the unmarked allocations and the related stores as sunk.

Mark Phase

The mark phase is a single-pass backward propagation algorithm. Initially, the marks of all IR instructions are clear. Then the roots are marked (most of this happens on-the-fly) and the marks are propagated backwards. The following roots are considered:

  1. All instructions referenced by the last snapshot of a trace, if the trace links to another trace.
  2. All instructions referenced by PHI instructions that do not reference an allocation instruction or that reference two different opcodes.
  3. The references for any remaining load or CALLL instructions.
  4. The arguments of any IR CALLS or CALLXS instructions.
  5. The stored value of all store instructions and CNEWI.
  6. The references for any ineligible store instructions.

A store instruction is only eligible, if all of these conditions are true:

  1. The store reference is a single instruction. Plus a fetch of the base pointer for some reference types.
  2. The store reference has a constant key, index or offset.
  3. The store reference points to an allocation.
  4. If the allocation is a PHI value, then the stored value must either be a PHI value itself (or an integer-to-number conversion of it) or a loop-invariant value.

Finally, the PHI references are iteratively remarked. If the left side and the right side have different marks or a different number of PHI values stored, then both sides are marked. This is repeated until the marks converge.

After this pass is complete, the allocations that are unmarked are considered safe to be sunk.

Sweep Phase

The sweep phase is a single-pass through the IR instructions:

  1. A PHI instruction is tagged as sunk if it refers to unmarked allocations (it’s enough to check one side).
  2. A store or a NEWREF is tagged as sunk if the corresponding allocation is unmarked.
  3. An allocation is tagged as sunk if it’s unmarked.

All marks are cleared, too. Tagging works as follows:

  • The IR is tagged to contain valid register/spill slot assignments.
  • All untagged instructions are set to REGSP_INIT (no register or spill slot allocated).
  • Sunk allocations, NEWREF and stores have the register field set to RID_SINK (treated like no register has been allocated to it).
  • Sunk allocations and NEWREF have a zero spill slot field.
  • Sunk stores have a spill slot field that holds the delta between the store and the corresponding allocation (or 255 if the delta is too large). This can be used as a quick test when traversing the IR in search of related stores (see below).

Assembler Backend

The assembler backend translates the IR to machine code. It needs to know about sunk instructions to avoid generating machine code for them.

Snapshot Allocations

The first encountered guard that needs to add an exit for a snapshot calls a special routine that prepares the current snapshot. Every instruction producing an escaping value must have a register or spill slot allocated to it if it doesn’t have one already.

This ensures that all values escaping through the snapshot are live at all exits. Processing is backwards, so any value live at the first encountered guard (highest IR instruction) is live at all other guards, up to the snapshot.

The snapshot preparation checks for sunk allocations and allocates the stored values instead. This involves searching for all stores related to an allocation. Only the store instructions between the allocation reference and the snapshot reference need to be considered.

Once this is done, the register for an allocation is changed from RID_SINK to RID_SUNK. This prevents extra invocations in case the allocation escapes to more than one stack slot or through another snapshot further up.

Additional logic tries to allocate a value for the input of an integer-to-number conversion instead of the conversion itself. This effectively sinks conversions and removes them from the fast path.

Sunk Instructions

A sunk allocation instruction has its register set to RID_SINK or RID_SUNK and a zero (unused) spill slot. The result is thus considered unused by on-the-fly dead-code elimination (DCE). No code is emitted for a sunk allocation.

Unused, sunk conversions are eliminated by DCE and no code is emitted for them.

Store instructions and the NEWREF instruction are not eligible for DCE, so they need an explicit check for RID_SINK in their backend handler to not emit any code.

Sunk PHI instructions need the same check to avoid allocating a register or spill slot for the referenced allocations.

Snapshot Handling

Snapshots hold the execution state from the view of the bytecode interpreter at selected points in a trace. Sunk allocations and sunk conversions are not actually executed on-trace and thus do not generate a value. However, the stored values (or input values for conversions) are computed and can be used to restore or replay a snapshot.

Snapshot Restore

A taken side exit that has no attached side trace (yet) needs to restore the Lua stack to a sane state. The snapshot corresponding to the exit has a list of Lua stack slots to be restored and their IR references. These can be used together with the register and spill slot assignments in the IR to restore the Lua stack from the exit state of the trace, which holds the current register and spill slot values.

Sunk allocations (that escape through a snapshot) have their register set to RID_SUNK. The allocation needs to be ‘unsunk’: the object needs to be allocated and the related stores need to be performed, as if the object had been allocated on-trace.

The snapshot allocation in the assembler backend makes sure that all sunk stores have live values at the exit. The actual values are restored via their IR reference and the exit state, too. Keys of sunk stores are always constant, so they can be reconstructed solely from the IR.

Care needs to be taken to de-duplicate allocation references. A sunk allocation may escape to more than one stack slot and all of them must refer to the same object.

Snapshot Replay

The IR for a side trace needs to start with the instructions that link to the parent values included in the snapshot. The SLOAD instruction with the flag IRSLOAD_PARENT provides parent links for values that correspond to a Lua stack slot.

Sunk allocations, sunk stores and sunk conversions need to be replayed instead. Sunk stores and conversions may reference values from the parent that do not have a corresponding Lua stack slot. The PVAL instruction provides the necessary parent links.

All parent links must be at the start of the trace, since their registers and stack slots must be coalesced with the parent trace atomically. Also, a PVAL must not be used if there’s a parent SLOAD for the same reference. This causes some ordering constraints, which mandates a multi-pass algorithm:

  1. Emit all parent SLOAD instructions for non-sunk slots from the snapshot.
  2. Emit PVAL instructions for all dependent values from sunk allocations or conversions, except if there’s a parent SLOAD instruction for the same reference.
  3. Replay the sunk instructions. Allocations need to replay the key references for the sunk stores and the sunk stores themselves, too.

[Note: this sounds more expensive, than it is. This algorithm iterates a limited number of times – usually less than five. Also, passes two and three are only performed when needed.]