Case Study: Writing Benchmarks for LuaVela

Introduction

Suppose that you want to micro-benchmark a certain part of the platform, say you want to know how much tonumber costs you. You know that LuaVela can be downloaded and used as a stand-alone application through Obtaining LuaVela and you have prepared a simple benchmark, like this:

-- file1.lua:
local x
for i = 1, 100000000 do
      x = tonumber("320")
end

-- file2.lua:
local x
for i = 1, 100000000 do
      x = 320
end

Then you run it like this:

$ /usr/bin/time -f%U ujit file1.lua
...
$ /usr/bin/time -f%U ujit file2.lua
...

Will the results be representative? Definitely not. Interested why? Then read on!

Analyzing the Benchmark

First, please note that the body of the loop in any original benchmark contains only loop invariants, i.e. expressions that do not depend on the loop variable. It means that the first benchmark can be rewritten as follows, with the body loop hoisted out of the loop:

local x = tonumber("320")
for i = 1, 100000000 do
end

Fortunately, the JIT compiler is smart enough to perform this hoisting, too. Unfortunately, for this particular case, ujit stand-alone application starts with JIT compile on. This means effectively that the two original benchmarks compare the speed of executing the same empty loop, which is definitely not what you wanted. If you dump the progress of the JIT compiler, you will see something like this:

$ ./ujit -p- -e 'local x; for i = 1, 100000000 do x = tonumber("320") end'
---- TRACE 1 start =(command line):1
0006    GGET     5   0      ; "tonumber"
0007    KSTR     6   1      ; "320"
0008    CALL     5   2   2
0000    . FUNCC               ; tonumber
0009    MOV      0   5
0011    FORL     1 => 0006
---- TRACE 1 IR
....              SNAP   #0   [ ---- ]
0001 rbp      int SLOAD  #2    CI
0002 r9       fun SLOAD  #0    R
0003 rsi      tab FLOAD  0002  func.env
0004 r8       int FLOAD  0003  tab.hmask
0005       >  int EQ     0004  +63
0006 rbx      p32 FLOAD  0003  tab.node
0007 rdx   >  p32 HREFK  0006  "tonumber" @13
0008 rax   >  fun HLOAD  0007
0009       >  fun EQ     0008  tonumber
0010 rbp    + int ADD    0001  +1
....              SNAP   #1   [ ---- 320 ]
0011       >  int LE     0010  +100000000
....              SNAP   #2   [ ---- 320 0010 ---- ---- 0010 ]
0012 ------------ LOOP ------------
0013 rbp    + int ADD    0010  +1
....              SNAP   #3   [ ---- 320 ]
0014       >  int LE     0013  +100000000
0015 rbp      int PHI    0010  0013
---- TRACE 1 mcode 155
0bd6ff65  mov r11, 0x7fc692377620
0bd6ff6f  mov dword [r11], 0x1
0bd6ff76  mov rcx, 0x7fc69237be98
0bd6ff80  cvtsd2si ebp, qword [r10+0x10]
0bd6ff86  mov r9, [r10-0x10]
0bd6ff8a  mov rsi, [r9+0x10]
0bd6ff8e  mov r8d, [rsi+0x38]
0bd6ff92  cmp r8d, 0x3f
0bd6ff96  jnz 0xbd60010             ->0
0bd6ff9c  mov rbx, [rsi+0x28]
0bd6ffa0  mov rdi, 0x7fc69237bed0
0bd6ffaa  cmp rdi, [rbx+0x218]
0bd6ffb1  jnz 0xbd60010             ->0
0bd6ffb7  cmp dword [rbx+0x220], 0xfffffffb
0bd6ffbe  jnz 0xbd60010             ->0
0bd6ffc4  lea rdx, [rbx+0x208]
0bd6ffcb  cmp dword [rdx+0x8], 0xfffffff7
0bd6ffcf  jnz 0xbd60010             ->0
0bd6ffd5  mov rax, [rdx]
0bd6ffd8  cmp rax, rcx
0bd6ffdb  jnz 0xbd60010             ->0
0bd6ffe1  add ebp, 0x1
0bd6ffe4  cmp ebp, 0x5f5e100
0bd6ffea  jg 0xbd60014              ->1
-> LOOP:
0bd6fff0  add ebp, 0x1
0bd6fff3  cmp ebp, 0x5f5e100
0bd6fff9  jle 0xbd6fff0             ->LOOP
0bd6fffb  jmp 0xbd6001c             ->3
---- TRACE 1 stop -> loop

Without diving into the details, please pay attention to the ------------ LOOP ------------ line: Everything above it is, roughly speaking, a loop invariant code. As you can see, everything that resembles a tonumber call is exactly above that line, while only some few pieces of ehm… something (IR instructions, but this really does not matter now) is below the line. Luckily, there is no need to examine dumps and dive into gory details of the compiler each time you benchmark something, just remember this:

Note

When benchmarking any code by wrapping the code into a loop, do one of the following:

  1. Switch the compiler off;
  2. Ensure that the benchmarked code is not loop-invariant, i.e. it must depend on the looping variable

Benchmarking with Compiler Off

Running ujit with compiler off lets you avoid any invisible side effects on your code, in this case the interpreter literally executes what you have written. Simply start ujit with -joff, or switch the compiler off directly in the Lua chunk with jit.off().

With this technique, you gain following:

  • You can estimate, how your hypotheses perform relatively to each other.
  • You can estimate, how your code will perform if this particular part of the code fails to JIT-compile (say, in case it unluckily becomes a part of the trace that contains a non-JITtable thing like a call to C API).

With this technique, you obviously lose following:

  • You’ll never know what performance you can achieve if the benchmarked code JIT compiles.

Now let’ see how we can fix the benchmark:

assert(jit.status() == false, "This benchmark is designed to run without JIT, please either -joff in command line, or jit.off() in Lua")

local N = 1e8

local function assign1()
      print("With tonumber")
      local n
      for i = 1, 100000000 do
         n = tonumber("320")
      end
      return n
end

local function assign2()
      print("Without tonumber")
      local n
      for i = 1, N do
         n = 320
      end
      return n
end

local function benchmark(name, t1, t2)
      if name == "tonumber" then
         assign1(t1)
      else
         assign2(t2)
      end
end

benchmark(arg[1], t1, t2)

And run it:

$ /usr/bin/time -f%U ./ujit -joff tonumber-no-jit.lua
...
$ /usr/bin/time -f%U ./ujit -joff tonumber-no-jit.lua tonumber
...

Benchmarking with Compiler On

As demonstrated above, running the benchmark with compiler on immediately exposes your code to various transformations done by the compiler. On other hand, this is much more fun! Just be careful:

  • No loop-invariants, remember?
  • You may want to read Reduce test cases to gain more inspiration and knowledge about interacting with the compiler.
  • You may ultimately want to learn the dump format, at least to estimate that you benchmark what you intended to. This one may be really tricky, LuaVela team understands that it is too cruel to MAKE you do so and always welcomes you to ask any questions of you feel you are stuck with benchmarking with the compiler turned on.

And despite these pitfalls, you definitely gain following:

  • You can estimate, how your hypotheses perform relatively to each other with JIT compilation on.
  • You can estimate, how your code will perform if this particular part of the code is lucky to JIT-compile.

So let’s see one possible variant of the benchmark.

First, let’s generate some data:

$ ujit -e 'print("local t1 = {");
for i = 100, 999 do print("\t\"" .. i .. "\",") end;
print "}";
print "local t2 = {";
for i = 100, 999 do print("\t" .. i .. ",") end;
print "}";
print("return {t1 = t1, t2 = t2}")
' >data.lua

And the benchmark itself:

assert(jit.status() == true, "This benchmark is designed to run with JIT, please either -jon in command line, or jit.on() in Lua")

local N = 1e8
local data = require("data")
assert(type(data) == "table")
assert(type(data.t1) == "table")
assert(type(data.t2) == "table")

local function assign1(t)
      print("With tonumber")
      local nels = #t
      local n
      for i = 1, N do
         n = tonumber(t[i % nels + 1])
      end
      return n
end

local function assign2(t)
      print("Without tonumber")
      local nels = #t
      local n
      for i = 1, N do
         n = t[i % nels + 1]
      end
      return n
end

local function benchmark(name, t1, t2)
      if name == "tonumber" then
         assign1(t1)
      else
         assign2(t2)
      end
end

benchmark(arg[1], data.t1, data.t2)

And run it:

$ /usr/bin/time -f%U ./ujit -jon tonumber-jit.lua
...
$ /usr/bin/time -f%U ./ujit -jon tonumber-jit.lua tonumber
...

Conclusion

  • Be careful when benchmarking with the compiler turned on: In particular, do not allow loop hoisting to spoil you benchmark and make it totally non-representative.
  • Do two-fold benchmarking:
    • To evaluate worst-case performance, benchmark with JIT compiler off;
    • To evaluate best-case performance, benchmark with JIT compiler on.
  • Do not hesitate to contact LuaVela team if you get puzzled by any result you observe.