.. _tut-allocation-sinking: Tutorial: Allocation Sinking Optimization ========================================= .. warning:: |PROJECT| developers disclaimer: This tutorial is a **carbon copy** of the missing LuaJIT wiki page. .. contents:: :local: Introduction ------------ The **Allocation Sinking and Store Sinking Optimization** was added to LuaJIT 2.0 in July 2012. This page documents the rationale for this optimization, the algorithm used and its implementation. The resulting performance is discussed and compared to other languages, too. .. _sponsor: http://luajit.org/sponsors.html .. |sponsor| replace:: sponsor The research and the development of this optimization has been sponsored by an anonymous corporate |sponsor|_ in 2012. Although this optimization is based on other well-known techniques, the combination may be considered innovative. I (Mike Pall) hereby donate the related intellectual property to the public domain. If you find it useful, please use it and share it! Rationale ^^^^^^^^^ Avoiding temporary allocations is an important optimization for high-level languages. Allocations are expensive and cause additional overhead since they need to be garbage collected later. The cost can be reduced with considerable effort (cf. the :ref:`New Garbage Collector `), but an eliminated allocation is always the cheapest option. LuaJIT already eliminates many temporary allocations with multiple techniques: e.g. floating-point numbers aren't boxed and the JIT compiler eliminates allocations for most immutable objects. However, many allocations remain, especially in OO-heavy workloads. The traditional techniques to avoid temporary allocations are escape analysis and scalar replacement of aggregates (SRA). Escape analysis verifies that an object doesn't escape through any code path. But code generated for dynamic languages has many fallback paths, e.g. to handle dynamic typing. Sadly, this means that almost any object inevitably escapes through at least one of these uncommon code paths. This makes the traditional techniques ineffective for dynamic languages. Motivating Example ^^^^^^^^^^^^^^^^^^ For explanatory purposes, the following synthetic example has an explicit code path where the allocation escapes to. Of course, the same underlying logic applies for implicit code paths caused by the dynamic language semantics. .. code-block:: lua local x, z = 0, nil for i=1,100 do local t = {i} if i == 90 then z = t end x = x + t[1] end print(x, z[1]) Although the uncommon code path for ``i == 90`` is only taken once, traditional escape analysis would fail, because ``t`` **does escape** to ``z`` at one point in the control flow. [Granted, no sane programmer would write such code. But it may arise indirectly as a consequence of higher language abstractions, e.g. when operating on aggregates. See below for some practical examples.] Let's take a look at the IR and the snapshots generated by LuaJIT with the sinking optimization disabled (``-O-sink``). Only the looping part of the trace is shown: .. code:: .... SNAP #4 [ ---- 0009 ---- 0010 ---- ---- 0010 ] 0012 ------ LOOP ------------ 0013 > tab TNEW #3 #0 0014 p32 FLOAD 0013 tab.array 0015 p32 AREF 0014 +1 0016 num CONV 0010 num.int 0017 num ASTORE 0015 0016 .... SNAP #5 [ ---- 0009 ---- 0010 ---- ---- 0010 0013 ] 0018 > int NE 0010 +90 0019 + num ADD 0016 0009 0020 + int ADD 0010 +1 .... SNAP #6 [ ---- 0019 ---- ] 0021 > int LE 0020 +100 0022 int PHI 0010 0020 0023 num PHI 0009 0019 As we can see, the table allocation ``TNEW`` in instruction 0013 escapes to the snapshot for side exit #5. Much more interesting is to see that the load ``t[1]`` has been eliminated. The stored value ``i`` (the left PHI operand at 0010, not shown) has been forwarded through the integer-to-number conversion (0016) and then added to ``x`` in instruction 0019. ``x`` itself escapes through side exit #6, but the allocation is dead before the next loop iteration. [Note: the code path starting at side exit #5 is never compiled, since the side exit is not taken often enough. What happens when a side trace is compiled is shown below.] The solution to the problem discussed above starts with an observation: there are no loads left for the temporary object! Thanks to advanced alias analysis, store-to-load-forwarding in LuaJIT is generally quite effective at removing all loads from temporary objects: they are simply replaced with the corresponding stored values. The only references to a temporary allocation are then the stores themselves (via the ``AREF`` and the ``FLOAD`` in the above example). However, these stores are pointless if the allocation is not used later on in the fast code path. The allocation is thus only created for consistency with the other code paths, but performs no real function for the fast path. The basic idea is now to move the temporary allocation where it's really needed. This is 'code motion' in compiler terminology. Since we want to move it to a side path of the code, a more precise term is 'sinking'. Here's the above example again, but with store-to-load-forwarding and sinking applied by hand on the right side: .. code-block:: local x, z = 0, nil | local x, z = 0, nil for i=1,100 do | for i=1,100 do local t = {i} | ---. if i == 90 then | if i == 90 then | | local t = {i} <--´ z = t | z = t | end | end | x = x + t[1] | x = x + i <--´ end | end print(x, z[1]) | print(x, z[1]) You can easily verify that both produce the exact same output. But the IR for the hand-optimized Lua code is very different: .. code:: .... SNAP #4 [ ---- 0005 ---- 0006 ---- ---- 0006 ] 0008 ------ LOOP ------------ .... SNAP #5 [ ---- 0005 ---- 0006 ---- ---- 0006 ] 0009 > int NE 0006 +90 0010 num CONV 0006 num.int 0011 + num ADD 0010 0005 0012 + int ADD 0006 +1 .... SNAP #6 [ ---- 0011 ---- ] 0013 > int LE 0012 +100 0014 int PHI 0006 0012 0015 num PHI 0005 0011 The allocation and the related store is gone from the fast path of the loop, which means it'll run much faster! The Goal ^^^^^^^^ The main innovation of the approach described above is to combine store-to-load-forwarding with store sinking and allocation sinking. This is highly effective in avoiding temporary allocations in the fast paths, even under the presence of many uncommon paths where the temporary object may escape to. The combination of the two optimizations has the same effect as scalar replacement of aggregates (SRA), but it's applicable in more contexts. This approach is most effective for dynamic languages, but may be successfully applied elsewhere, when the classic techniques fail. The goal is to automatically perform the above optimization, whenever feasible and profitable. Note this optimization completely *eliminates* the allocation from the fast path. It does not turn a heap allocation into a stack allocation. The stored values are still live, usually in registers and sometimes in spill slots. However, the layout of spill slots does not need to match the layout of an allocated object (nor is it suitable to be passed to a C function expecting such an object). The Missing Details ^^^^^^^^^^^^^^^^^^^ The above example is greatly simplified, of course. In practice, an allocation may escape to more than one snapshot or to more than one stack slot in the same snapshot. There may be stores both before and after a snapshot, which may even overwrite each other's values. And, most importantly, temporary allocations may be held in loop-carried variables, which further complicates the analysis. Sinking an allocation that escapes to a snapshot means extra work for the data-driven exit handler: the allocation needs to be unsunk, i.e. created on-the-fly and the related stores need to be performed, too. Similarly, a side trace may need to be compiled corresponding to a snapshot with a sunk allocation: the allocation and the related stores need to be replayed in the side trace. For a truly general solution, a sunk allocation that's replayed in a side trace may be sunk again, of course. The same approach needs to work for all types of allocations, not just for Lua tables. Temporary allocations of mutable and immutable FFI objects are common, too. Examples -------- The exact algorithm used for this optimization is explained in the implementation section below. For better understanding, this section shows the results of the optimization for a range of examples. Motivating Example Again ^^^^^^^^^^^^^^^^^^^^^^^^ Here's the motivating example again plus the IR and the machine code for the loop part with the sinking optimization turned on: .. code-block:: lua local x, z = 0, nil for i=1,100 do local t = {i} if i == 90 then z = t end x = x + t[1] end print(x, z[1]) .. code:: .... SNAP #4 [ ---- 0009 ---- 0010 ---- ---- 0010 ] 0012 ------------ LOOP ------------ 0013 {sink} tab TNEW #3 #0 0014 p32 FLOAD 0013 tab.array 0015 p32 AREF 0014 +1 0016 xmm6 num CONV 0010 num.int 0017 {0013} num ASTORE 0015 0016 .... SNAP #5 [ ---- 0009 ---- 0010 ---- ---- 0010 0013 ] 0018 > int NE 0010 +90 0019 xmm7 + num ADD 0016 0009 0020 rbp + int ADD 0010 +1 .... SNAP #6 [ ---- 0019 ---- ] 0021 > int LE 0020 +100 0022 rbp int PHI 0010 0020 0023 xmm7 num PHI 0009 0019 .. code:: ->LOOP: 394cffd0 xorps xmm6, xmm6 394cffd3 cvtsi2sd xmm6, ebp 394cffd7 cmp ebp, +0x5a 394cffda jz 0x394c0024 ->5 394cffe0 addsd xmm7, xmm6 394cffe4 add ebp, +0x01 394cffe7 cmp ebp, +0x64 394cffea jle 0x394cffd0 ->LOOP 394cffec jmp 0x394c0028 ->6 As you can see, the allocation and the store have been sunk. No register or spill slot is allocated to them: it shows ``{sink}`` or the reference of the allocation corresponding to the store instead. The machine code for the loop contains only the conversion, the ``i == 90`` check, the summation of ``x`` and the loop increment plus boundary check. [Note: the ``xorps`` avoids a partial-register-write stall for the SSE2 conversion instruction.] Re-Sinking ^^^^^^^^^^ Here's an example that shows how a sunk allocation is sunk again in a side trace: .. code-block:: lua local z = nil for i=1,200 do local t = {i} if i > 100 then if i == 190 then z = t end end end print(z[1]) Here's the IR for the first trace (the ``}`` shows sunk instructions): .. code:: .... SNAP #0 [ ---- ] 0001 int SLOAD #2 CI 0002 } tab TNEW #3 #0 0003 p32 FLOAD 0002 tab.array 0004 p32 AREF 0003 +1 0005 num CONV 0001 num.int 0006 } num ASTORE 0004 0005 .... SNAP #1 [ ---- ---- 0001 ---- ---- 0001 0002 ] 0007 > int LE 0001 +100 0008 + int ADD 0001 +1 .... SNAP #2 [ ---- ---- ] 0009 > int LE 0008 +200 .... SNAP #3 [ ---- ---- 0008 ---- ---- 0008 ] 0010 ------ LOOP ------------ 0011 } tab TNEW #3 #0 0012 p32 FLOAD 0011 tab.array 0013 p32 AREF 0012 +1 0014 num CONV 0008 num.int 0015 } num ASTORE 0013 0014 .... SNAP #4 [ ---- ---- 0008 ---- ---- 0008 0011 ] 0016 > int LE 0008 +100 0017 + int ADD 0008 +1 .... SNAP #5 [ ---- ---- ] 0018 > int LE 0017 +200 0019 int PHI 0008 0017 A side trace is spawned, the sunk allocation and the sunk store is replayed before the first snapshot. Then it's sunk again: .. code:: 0001 int SLOAD #2 PI 0002 } tab TNEW #3 #0 0003 p32 FLOAD 0002 tab.array 0004 p32 AREF 0003 +1 0005 num CONV 0001 num.int 0006 } num ASTORE 0004 0005 .... SNAP #0 [ ---- ---- 0001 ---- ---- 0001 0002 ] 0007 > nil GCSTEP .... SNAP #1 [ ---- ---- 0001 ---- ---- ---- 0002 ] 0008 > int NE 0001 +190 0009 int ADD 0001 +1 .... SNAP #2 [ ---- ---- ] 0010 > int LE 0009 +200 0011 num CONV 0009 num.int .... SNAP #3 [ ---- ---- 0011 ---- ---- 0011 ] [An explicit ``GCSTEP`` instruction is needed after the first snapshot to prevent implicit GC steps before the first snapshot. It's really a no-op here, since no allocations are performed.] Point Class ^^^^^^^^^^^ Here's a practical example of a higher language abstraction: a (simplified) 'point' class. It needs to create temporary objects as part of arithmetic operations on point objects. Point Class With Lua Tables """"""""""""""""""""""""""" First, the version for plain Lua tables: .. code-block:: lua local point point = { new = function(self, x, y) return setmetatable({x=x, y=y}, self) end, __add = function(a, b) return point:new(a.x + b.x, a.y + b.y) end, } point.__index = point local a, b = point:new(1.5, 2.5), point:new(3.25, 4.75) for i=1,100000000 do a = (a + b) + b end print(a.x, a.y) It creates two temporary objects per iteration, but LuaJIT is able to optimize them away. Here's the machine code for the loop part: .. code:: ->LOOP: 394cffe0 addsd xmm6, xmm1 394cffe4 addsd xmm7, xmm0 394cffe8 addsd xmm6, xmm1 394cffec addsd xmm7, xmm0 394cfff0 add ebp, +0x01 394cfff3 cmp ebp, 0x05f5e100 394cfff9 jle 0x394cffe0 ->LOOP 394cfffb jmp 0x394c0028 ->6 [Note: floating-point arithmetic is not associative. The compiler is not allowed to transform ``(a+b)+b`` into ``a+(b+b)`` and then hoist out the computation of ``b+b``. It really needs two FP additions per component for a total of four SSE2 ``addsd`` instructions per loop iteration.] Point Class With FFI cdata struct """"""""""""""""""""""""""""""""" .. code-block:: lua local ffi = require("ffi") local point point = ffi.metatype("struct { double x, y; }", { __add = function(a, b) return point(a.x + b.x, a.y + b.y) end }) local a, b = point(1.5, 2.5), point(3.25, 4.75) for i=1,100000000 do a = (a + b) + b end print(a.x, a.y) The machine code inside the loop is identical to the above example. A similar example with immutable 'complex' cdata objects generates the same code, too. Point Class in C++ and Java """"""""""""""""""""""""""" For comparison, here's the same point class implemented in C++: .. code-block:: c++ #include class Point { public: Point(double x, double y) : x(x), y(y) { } Point operator+(const Point &b) const { return Point(x+b.x, y+b.y); } double x, y; }; int main() { int i; Point a(1.5, 2.5), b(3.25, 4.75); for (i = 0; i < 100000000; i++) a = (a + b) + b; std::cout << a.x << ' ' << a.y << '\n'; return 0; } Note that this code is passing aggregate values around and not actually allocating objects. However, the C++ compiler still has to perform SRA for optimal results. Not surprisingly, the machine code for the loop is basically the same as the code LuaJIT generates. [We'll ignore auto-vectorization/auto-simdization for the moment.] And here's the same point class implemented in Java: .. code-block:: java public class Point { public double x, y; public Point(double x0, double y0) { x = x0; y = y0; } public Point add(Point b) { return new Point(x + b.x, y + b.y); } public static void main(String args[]) { int i; Point a = new Point(1.5, 2.5); Point b = new Point(2.25, 4.75); for (i = 0; i < 100000000; i++) a = a.add(b).add(b); System.out.println(a.x + " " + a.y); } } JVM/Hotspot 1.7 is unable to eliminate the allocations. Adding the option ``-XX:+DoEscapeAnalysis`` doesn't change anything. Moving the loop to a separate method or using an outer loop doesn't help either. Point Class Benchmarks """""""""""""""""""""" Here's the runtime for the point class in seconds (YMMV). Lower is better: .. container:: table-wrap ====== ============ =========================== Time Point object VM/Compiler ====== ============ =========================== 140.0 Lua table Lua 5.1.5 26.9 Lua table LuaJIT 2.0 git HEAD -O-sink 10.9 FFI struct LuaJIT 2.0 git HEAD -O-sink 0.2 Lua table LuaJIT 2.0 git HEAD -O+sink 0.2 FFI struct LuaJIT 2.0 git HEAD -O+sink 0.2 C++ class GCC 4.4.3 -O2 (or -O3) 1.2 Java class JVM/Hotspot 1.7.0_05 ====== ============ =========================== LuaJIT is around 700 times faster than plain Lua and it's the same speed as C++ -- **for this example**. JVM 1.7 is unable to eliminate the allocations, but has a fast allocator and garbage collector. Still, the JVM is around 6 times slower than LuaJIT or C++ for this example. Note: this cannot be extrapolated to other code, of course. And it does **not** lead to a generalizable statement about the relative performance of Lua, LuaJIT, Java or C++. [Yes, a runtime of 0.2 seconds is too low for precise results. But for this simple example, scaling the loop iterations up yields consistent results. The code generated by LuaJIT and C++ *really* has the same performance.] Implementation -------------- Different parts of the JIT-compiler are involved in handling allocation sinking and store sinking. The following subsections describe the required additions and changes to the implementation. SINK Optimization Pass ^^^^^^^^^^^^^^^^^^^^^^ The SINK optimization pass is only invoked, if: #. The SINK optimization and the FOLD, CSE, DCE and FWD optimizations are turned on. This is the case for ``-O3``, which is the default. #. The IR of the current trace holds at least one ``TNEW``, ``TDUP``, ``CNEW`` or ``CNEWI`` instruction. This pass traverses the IR of the current trace in two phases, very much like a classic mark & sweep garbage collection algorithm: * The mark phase marks all allocations that cannot be sunk. * The sweep phase tags the unmarked allocations and the related stores as sunk. Mark Phase """""""""" The mark phase is a single-pass backward propagation algorithm. Initially, the marks of all IR instructions are clear. Then the roots are marked (most of this happens on-the-fly) and the marks are propagated backwards. The following roots are considered: #. All instructions referenced by the last snapshot of a trace, if the trace links to another trace. #. All instructions referenced by ``PHI`` instructions that do not reference an allocation instruction or that reference two different opcodes. #. The references for any remaining load or ``CALLL`` instructions. #. The arguments of any IR ``CALLS`` or ``CALLXS`` instructions. #. The stored value of all store instructions and ``CNEWI``. #. The references for any ineligible store instructions. A store instruction is only eligible, if all of these conditions are true: #. The store reference is a single instruction. Plus a fetch of the base pointer for some reference types. #. The store reference has a constant key, index or offset. #. The store reference points to an allocation. #. If the allocation is a PHI value, then the stored value must either be a PHI value itself (or an integer-to-number conversion of it) or a loop-invariant value. Finally, the PHI references are iteratively remarked. If the left side and the right side have different marks or a different number of PHI values stored, then both sides are marked. This is repeated until the marks converge. After this pass is complete, the allocations that are unmarked are considered safe to be sunk. Sweep Phase """"""""""" The sweep phase is a single-pass through the IR instructions: #. A ``PHI`` instruction is tagged as sunk if it refers to unmarked allocations (it's enough to check one side). #. A store or a ``NEWREF`` is tagged as sunk if the corresponding allocation is unmarked. #. An allocation is tagged as sunk if it's unmarked. All marks are cleared, too. Tagging works as follows: * The IR is tagged to contain valid register/spill slot assignments. * All untagged instructions are set to REGSP_INIT (no register or spill slot allocated). * Sunk allocations, ``NEWREF`` and stores have the register field set to RID_SINK (treated like no register has been allocated to it). * Sunk allocations and ``NEWREF`` have a zero spill slot field. * Sunk stores have a spill slot field that holds the delta between the store and the corresponding allocation (or 255 if the delta is too large). This can be used as a quick test when traversing the IR in search of related stores (see below). Assembler Backend ^^^^^^^^^^^^^^^^^ The assembler backend translates the IR to machine code. It needs to know about sunk instructions to avoid generating machine code for them. Snapshot Allocations """""""""""""""""""" The first encountered guard that needs to add an exit for a snapshot calls a special routine that prepares the current snapshot. Every instruction producing an escaping value must have a register or spill slot allocated to it if it doesn't have one already. This ensures that all values escaping through the snapshot are live at all exits. Processing is backwards, so any value live at the first encountered guard (highest IR instruction) is live at all other guards, up to the snapshot. The snapshot preparation checks for sunk allocations and allocates the stored values instead. This involves searching for all stores related to an allocation. Only the store instructions between the allocation reference and the snapshot reference need to be considered. Once this is done, the register for an allocation is changed from RID_SINK to RID_SUNK. This prevents extra invocations in case the allocation escapes to more than one stack slot or through another snapshot further up. Additional logic tries to allocate a value for the input of an integer-to-number conversion instead of the conversion itself. This effectively sinks conversions and removes them from the fast path. Sunk Instructions """"""""""""""""" A sunk allocation instruction has its register set to RID_SINK or RID_SUNK and a zero (unused) spill slot. The result is thus considered unused by on-the-fly dead-code elimination (DCE). No code is emitted for a sunk allocation. Unused, sunk conversions are eliminated by DCE and no code is emitted for them. Store instructions and the ``NEWREF`` instruction are not eligible for DCE, so they need an explicit check for RID_SINK in their backend handler to not emit any code. Sunk ``PHI`` instructions need the same check to avoid allocating a register or spill slot for the referenced allocations. Snapshot Handling ^^^^^^^^^^^^^^^^^ Snapshots hold the execution state from the view of the bytecode interpreter at selected points in a trace. Sunk allocations and sunk conversions are not actually executed on-trace and thus do not generate a value. However, the stored values (or input values for conversions) are computed and can be used to restore or replay a snapshot. Snapshot Restore """""""""""""""" A taken side exit that has no attached side trace (yet) needs to restore the Lua stack to a sane state. The snapshot corresponding to the exit has a list of Lua stack slots to be restored and their IR references. These can be used together with the register and spill slot assignments in the IR to restore the Lua stack from the exit state of the trace, which holds the current register and spill slot values. Sunk allocations (that escape through a snapshot) have their register set to RID_SUNK. The allocation needs to be 'unsunk': the object needs to be allocated and the related stores need to be performed, as if the object had been allocated on-trace. The snapshot allocation in the assembler backend makes sure that all sunk stores have live values at the exit. The actual values are restored via their IR reference and the exit state, too. Keys of sunk stores are always constant, so they can be reconstructed solely from the IR. Care needs to be taken to de-duplicate allocation references. A sunk allocation may escape to more than one stack slot and all of them must refer to the same object. Snapshot Replay """"""""""""""" The IR for a side trace needs to start with the instructions that link to the parent values included in the snapshot. The ``SLOAD`` instruction with the flag IRSLOAD_PARENT provides parent links for values that correspond to a Lua stack slot. Sunk allocations, sunk stores and sunk conversions need to be replayed instead. Sunk stores and conversions may reference values from the parent that do not have a corresponding Lua stack slot. The ``PVAL`` instruction provides the necessary parent links. All parent links must be at the start of the trace, since their registers and stack slots must be coalesced with the parent trace atomically. Also, a ``PVAL`` must not be used if there's a parent ``SLOAD`` for the same reference. This causes some ordering constraints, which mandates a multi-pass algorithm: #. Emit all parent ``SLOAD`` instructions for non-sunk slots from the snapshot. #. Emit ``PVAL`` instructions for all dependent values from sunk allocations or conversions, except if there's a parent ``SLOAD`` instruction for the same reference. #. Replay the sunk instructions. Allocations need to replay the key references for the sunk stores and the sunk stores themselves, too. [Note: this sounds more expensive, than it is. This algorithm iterates a limited number of times -- usually less than five. Also, passes two and three are only performed when needed.]