Tutorial: How Folding Engine Works

When folding happens

Every time you invoke emitir, you actually pass IR to folding engine:

/* Pass IR on to next optimization in chain (FOLD). */
#define emitir(ot, a, b)        (lj_ir_set(J, (ot), (a), (b)), lj_opt_fold(J))

IR emission, in turn, takes place on trace recording, on loop optimization phase and on snapshot state replay during side trace setup.

Folding engine is the first stage in a group of optimizations:

  • FOLD – folding rule based engine.
  • CSE – common subexpression elimination.
  • FWD – load/store forwarding.
  • DSE – dead store elimination.

And all these optimizations will be followed by an actual IR emission phase (in case it was not optimized away), which is simply placing IR in trace IR buffer while maintaining ‘same opcode’ chains of already emitted IRs. One can entirely bypass the folding engine and optimizations mentioned above via emitir_raw macro.

So when you deal with folding engine you should expect an type of IR as an input. As a consequence, for example, (remember that loads/stores optimizations cannot be performed without proper alias analysis) there must be rules for loads and stores to skip folding and CSE for these IRs and dispatch them to FWD or DSE instead.

How folding works

emitir macro assigns passed IR to J->fold.ins and calls lj_opt_fold(J) , which does the following:

  • Check compiler flags for early exit, which may be:

    • CSE pass for normal IRs in case FOLD pass is disabled
    • Emitter (i.e. emitir becomes effectively equal to emitir_raw ) for loads (except SLOAD, in case not all of FOLD, FWD and CSE are enabled) and stores (in case not both FOLD and DSE are enabled)
  • Lookup for folding rule:

    • Construct key, which is

      |folded ir|left ir|right ir opcode|
      | opcode  | opcode| or literal    |

      At this stage J->fold.left and J->fold.right are also filled with folded IR operands to save typing while implementing fold rules.

    • Apply any-masking. Folding engine tries to lookup rules not only for exact combination of opcode and left/right operands, but also successively masks operands with ‘any’ wildcard, from most specific to least specific:

      • opcode left right
      • opcode any right
      • opcode left any
      • opcode any any
    • Lookup a folding rule by masked key in semi-perfect hash table (which is implemented as an array generated by buildvm that searches folding rules by LJFOLD defines, look into buildvm_fold.c for details).

    • Run rule on successful lookup.

  • Handle folding rule result, which may be:

    • NEXTFOLD. Means that no folding was applied, because of additional test inside the folding rule. Matching continues with the next iteration of any-masking. If no rule was found for the least specific variant of instruction (with any-any operands), is passed to CSE.
    • RETRYFOLD. Means instruction was modified in-place. Folding will be retried as if this instruction had just been received.
    • FAILFOLD, DROPFOLD. Applied to instructions which have no result, for example guarded assertions. FAILFOLD means that guard will always fail and trace should be aborted. DROPFOLD means that guard is always successful.
    • IR reference (IRRef). This reference may be a result of actual emission, reference to already emitted IR, result of CSE, etc.

How folding rule looks like

Here are examples of folding rules:


    return LEFTFOLD;  /* f(g(x)) ==> g(x) */

Defines used for description of folding rules:

  • LJFOLD/LJFOLDX/LJFOLDF. They are used by buildvm to generate folding engine hash table. LJFOLD(opcode left right) declares folding rule for opcode with left and right operands, folding function is specified below with either already defined function (LJFOLDX(funcname)) or function defined in-place (LJFOLDF(funcname)). For the first example above, folding rule applies CSE to comparisons of FLOAD results with IR_KNULL. And the last example simplifies chains of successive IR_ABS opcodes (KNUM here matches ABS auxiliary implementation specific constant).
    Means result is a reference provided by left/right operand.
    Means result is raw emission or instruction is passed to CSE.
  • INTFOLD(i).
    Means that i will be interned and corresponding reference will be returned as folding result.
    Described above.
  • CONDFOLD(cond).
    Is equal to “cond ? DROPFOLD : FAILFOLD"
  • Defines to save typing:
    • #define fins (&J->fold.ins)
    • #define fleft (&J->fold.left)
    • #define fright (&J->fold.right)
    • #define knumleft (ir_knum(fleft)->n)
    • #define knumright (ir_knum(fright)→n)

Please take the following into account while creating/modifying folding rules:

  • Fold rules must preserve destination type, i.e. use INTFOLD for KINT result lj_ir_knum for KNUM result, DROPFOLD/FAILFOLD/CONDFOLD for no result and never use lj_ir_knumint.
  • Fold rules should not create new instructions which reference operands across PHIs. E.g. a RETRYFOLD with fins->op1 = fleft->op1 is invalid if the left operand is a PHI. One could workaround the requirement by issuing new PHIs, but this should be considered as counterproductive. The solution is to abort folding rule via NEXTFOLD if fleft is PHI, which could be shorthanded by equivalent PHIBARRIER(fleft) define. Even though this requirement is easy to slip while writing folding rules, you will likely be punished with either of consequences of corrupted control flow, for example with test timed out in an infinite loop. Note: returning  existing instructions (e.g. LEFTFOLD) is ok.
  • Fold rules must be monotonic to guarantee termination. That means rules should point towards eliminating instructions or replacing with one or more simpler instructions or moving constants to right operand to enable other folding rules.

List of available folding rules with examples

  • Constant folding for FP numbers

  • Constant folding of conversions

  • Constant folding of pointer arithmetic

  • Constant folding of equality checks

  • Constant folding for 32 bit integers

  • Constant folding for 64 bit integers

    • Computes result for constant operands and returns reference to it instead of actual emission of instruction. uj_math_foldarith and uj_math_foldfpm are used for FP folding.
  • Constant folding for strings

    • Interns strings with constant payload, zero size or result of strcmp on constant strings.
  • Algebraic shortcuts

    • round(round(x)) = round(x)
    • abs(abs(x)) = abs(x)
    • abs(neg(x)) = abs(x)
    • neg(neg(x)) = x; same for bnot and bswap
  • Bit operations simplifications

    • i & 0 = 0; i & -1 = i
    • i | 0 = i; i | -1 = -1
    • i ^ 0 = i; i ^ -1 = ~i
    • shifts with constants and masks simplifications
  • Integer algebraic simplifications

    • i + 0 = i; same for subtraction
    • i * 0 = 0
    • i * 1 = i
    • i * 2 = i + i
    • i * 2^k = i << k
    • (i + j) - i = j and similar
    • (i + j1) - (i + j2) = j1 - j2 and similar
  • FP algebraic simplifications

    • a + (-b) = a - b
    • a - (-b) = a + b
    • (-x) - k = (-k) - x
    • x * 1 = x; x * (-1) = -x; same for division
    • x - 0.0 = x
    • x * 2 = x + x
    • (-x) * (-x) = x * x
    • 2.0 ^ i = ldexp(1.0, tonum(i))
    • x ^ 0 = 1; x ^ 1 = x; x^k = x*x*…*x
  • Conversion simplifications

    • tonum(tonum(x)) = tonum(x)
    • int(int(x)) = int(x)
    • eliminate widening to 64-bits for results of 32-bit operations as they do an implicit zero-extension
    • special CSE rule for conversions
  • Reassociation
    • (i + k1) + k2 = i + (k1 + k2); same for multiplication, bit operations
    • min(min(a,b), a) = min(a,b); same for max
  • Array bounds check elimination

    • ABC(asize, (i + k) + (-k)) → ABC(asize, i) if it already exists
    • ABC(asize, k1), ABC(asize, k2) → ABC(asize, max(k1, k2)), drop second ABC if k2 is lower, otherwise patch first ABC with k2.
    • Eliminates invariant ABC inside loop.
  • Commutativity folding

    • Canonicalizes commutative instructions with lower refs going to the right operand (remember, that constants have lower refs than instructions). It helps to reduce folding rules (constants are always right operands), helps CSE to find more matches, helps assembler to generate better code with constants at the right.
  • Simplification of compound expressions

    • Turns

      string.sub(str, a, b) == kstr


      string.byte(str, a) == string.byte(kstr, 1) && string.byte(str, a + 1) == string.byte(kstr, 2) && etc.
  • Load folding rule

    • Dispatches LOADs to various load forwarding optimizations.
    • CSE for upvalue references.
    • CSE for loads from strings as they are immutable.
    • Probably CSE for loads from immutable tables should also reside here.
  • Write barrier folding rules

    • CSE for barriers between GC steps.
  • Store and allocation folding rules

    • Dispatches STOREs to various DSE optimizations.
    • Rules to emit IRs with side effects and bypass CSE for them.
  • Concatenation folding

    • Handles folding concatenation of constant strings.