pdf-parse-new
Version:
Pure javascript cross-platform module to extract text from PDFs with AI-powered optimization and multi-core processing.
1,083 lines (1,073 loc) • 83.2 kB
Plain Text
Trace-based Just-in-Time Type Specialization for Dynamic
Languages
Andreas Gal
∗+
, Brendan Eich
∗
, Mike Shaver
∗
, David Anderson
∗
, David Mandelin
∗
,
Mohammad R. Haghighat
$
, Blake Kaplan
∗
, Graydon Hoare
∗
, Boris Zbarsky
∗
, Jason Orendorff
∗
,
Jesse Ruderman
∗
, Edwin Smith
#
, Rick Reitmaier
#
, Michael Bebenita
+
, Mason Chang
+#
, Michael Franz
+
Mozilla Corporation
∗
{gal,brendan,shaver,danderson,dmandelin,mrbkap,graydon,bz,jorendorff,jruderman}@mozilla.com
Adobe Corporation
#
{edwsmith,rreitmai}@adobe.com
Intel Corporation
$
{mohammad.r.haghighat}@intel.com
University of California, Irvine
+
{mbebenit,changm,franz}@uci.edu
Abstract
Dynamic languages such as JavaScript are more difficult to com-
pile than statically typed ones. Since no concrete type information
is available, traditional compilers need to emit generic code that can
handle all possible type combinations at runtime. We present an al-
ternative compilation technique for dynamically-typed languages
that identifies frequently executed loop traces at run-time and then
generates machine code on the fly that is specialized for the ac-
tual dynamic types occurring on each path through the loop. Our
method provides cheap inter-procedural type specialization, and an
elegant and efficient way of incrementally compiling lazily discov-
ered alternative paths through nested loops. We have implemented
a dynamic compiler for JavaScript based on our technique and we
have measured speedups of 10x and more for certain benchmark
programs.
Categories and Subject Descriptors D.3.4 [Programming Lan-
guages]: Processors — Incremental compilers, code generation.
General Terms Design, Experimentation, Measurement, Perfor-
mance.
Keywords JavaScript, just-in-time compilation, trace trees.
1. Introduction
Dynamic languages such as JavaScript, Python, and Ruby, are pop-
ular since they are expressive, accessible to non-experts, and make
deployment as easy as distributing a source file. They are used for
small scripts as well as for complex applications. JavaScript, for
example, is the de facto standard for client-side web programming
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. To copy otherwise, to republish, to post on servers or to redistribute
to lists, requires prior specific permission and/or a fee.
PLDI’09, June 15–20, 2009, Dublin, Ireland.
Copyright c© 2009 ACM 978-1-60558-392-1/09/06. . . $5.00
and is used for the application logic of browser-based productivity
applications such as Google Mail, Google Docs and Zimbra Col-
laboration Suite. In this domain, in order to provide a fluid user
experience and enable a new generation of applications, virtual ma-
chines must provide a low startup time and high performance.
Compilers for statically typed languages rely on type informa-
tion to generate efficient machine code. In a dynamically typed pro-
gramming language such as JavaScript, the types of expressions
may vary at runtime. This means that the compiler can no longer
easily transform operations into machine instructions that operate
on one specific type. Without exact type information, the compiler
must emit slower generalized machine code that can deal with all
potential type combinations. While compile-time static type infer-
ence might be able to gather type information to generate opti-
mized machine code, traditional static analysis is very expensive
and hence not well suited for the highly interactive environment of
a web browser.
We present a trace-based compilation technique for dynamic
languages that reconciles speed of compilation with excellent per-
formance of the generated machine code. Our system uses a mixed-
mode execution approach: the system starts running JavaScript in a
fast-starting bytecode interpreter. As the program runs, the system
identifies hot (frequently executed) bytecode sequences, records
them, and compiles them to fast native code. We call such a se-
quence of instructions a trace.
Unlike method-based dynamic compilers, our dynamic com-
piler operates at the granularity of individual loops. This design
choice is based on the expectation that programs spend most of
their time in hot loops. Even in dynamically typed languages, we
expect hot loops to be mostly type-stable, meaning that the types of
values are invariant. (12) For example, we would expect loop coun-
ters that start as integers to remain integers for all iterations. When
both of these expectations hold, a trace-based compiler can cover
the program execution with a small number of type-specialized, ef-
ficiently compiled traces.
Each compiled trace covers one path through the program with
one mapping of values to types. When the VM executes a compiled
trace, it cannot guarantee that the same path will be followed
or that the same types will occur in subsequent loop iterations.
Hence, recording and compiling a trace speculates that the path and
typing will be exactly as they were during recording for subsequent
iterations of the loop.
Every compiled trace contains all the guards (checks) required
to validate the speculation. If one of the guards fails (if control
flow is different, or a value of a different type is generated), the
trace exits. If an exit becomes hot, the VM can record a branch
trace starting at the exit to cover the new path. In this way, the VM
records a trace tree covering all the hot paths through the loop.
Nested loops can be difficult to optimize for tracing VMs. In
a na¨ıve implementation, inner loops would become hot first, and
the VM would start tracing there. When the inner loop exits, the
VM would detect that a different branch was taken. The VM would
try to record a branch trace, and find that the trace reaches not the
inner loop header, but the outer loop header. At this point, the VM
could continue tracing until it reaches the inner loop header again,
thus tracing the outer loop inside a trace tree for the inner loop.
But this requires tracing a copy of the outer loop for every side exit
and type combination in the inner loop. In essence, this is a form
of unintended tail duplication, which can easily overflow the code
cache. Alternatively, the VM could simply stop tracing, and give up
on ever tracing outer loops.
We solve the nested loop problem by recording nested trace
trees. Our system traces the inner loop exactly as the na¨ıve version.
The system stops extending the inner tree when it reaches an outer
loop, but then it starts a new trace at the outer loop header. When
the outer loop reaches the inner loop header, the system tries to call
the trace tree for the inner loop. If the call succeeds, the VM records
the call to the inner tree as part of the outer trace and finishes
the outer trace as normal. In this way, our system can trace any
number of loops nested to any depth without causing excessive tail
duplication.
These techniques allow a VM to dynamically translate a pro-
gram to nested, type-specialized trace trees. Because traces can
cross function call boundaries, our techniques also achieve the ef-
fects of inlining. Because traces have no internal control-flow joins,
they can be optimized in linear time by a simple compiler (10).
Thus, our tracing VM efficiently performs the same kind of op-
timizations that would require interprocedural analysis in a static
optimization setting. This makes tracing an attractive and effective
tool to type specialize even complex function call-rich code.
We implemented these techniques for an existing JavaScript in-
terpreter, SpiderMonkey. We call the resulting tracing VM Trace-
Monkey. TraceMonkey supports all the JavaScript features of Spi-
derMonkey, with a 2x-20x speedup for traceable programs.
This paper makes the following contributions:
•
We explain an algorithm for dynamically forming trace trees to
cover a program, representing nested loops as nested trace trees.
•
We explain how to speculatively generate efficient type-specialized
code for traces from dynamic language programs.
•
We validate our tracing techniques in an implementation based
on the SpiderMonkey JavaScript interpreter, achieving 2x-20x
speedups on many programs.
The remainder of this paper is organized as follows. Section 3 is
a general overview of trace tree based compilation we use to cap-
ture and compile frequently executed code regions. In Section 4
we describe our approach of covering nested loops using a num-
ber of individual trace trees. In Section 5 we describe our trace-
compilation based speculative type specialization approach we use
to generate efficient machine code from recorded bytecode traces.
Our implementation of a dynamic type-specializing compiler for
JavaScript is described in Section 6. Related work is discussed in
Section 8. In Section 7 we evaluate our dynamic compiler based on
1 for (var i = 2; i < 100; ++i) {
2 if (!primes[i])
3 continue;
4 for (var k = i + i; i < 100; k += i)
5 primes[k] = false;
6 }
Figure 1. Sample program: sieve of Eratosthenes. primes is
initialized to an array of 100 false values on entry to this code
snippet.
Interpret
Bytecodes
Monitor
Record
LIR Trace
Execute
Compiled Trace
Enter
Compiled Trace
Compile
LIR Trace
Leave
Compiled Trace
loop
edge
hot
loop/exit
abort
recording
finish at
loop header
cold/blacklisted
loop/exit
compiled trace
ready
loop edge with
same types
side exit to
existing trace
side exit,
no existing trace
Overhead
Interpreting
Native
Symbol Key
Figure 2. State machine describing the major activities of Trace-
Monkey and the conditions that cause transitions to a new activ-
ity. In the dark box, TM executes JS as compiled traces. In the
light gray boxes, TM executes JS in the standard interpreter. White
boxes are overhead. Thus, to maximize performance, we need to
maximize time spent in the darkest box and minimize time spent in
the white boxes. The best case is a loop where the types at the loop
edge are the same as the types on entry–then TM can stay in native
code until the loop is done.
a set of industry benchmarks. The paper ends with conclusions in
Section 9 and an outlook on future work is presented in Section 10.
2. Overview: Example Tracing Run
This section provides an overview of our system by describing
how TraceMonkey executes an example program. The example
program, shown in Figure 1, computes the first 100 prime numbers
with nested loops. The narrative should be read along with Figure 2,
which describes the activities TraceMonkey performs and when it
transitions between the loops.
TraceMonkey always begins executing a program in the byte-
code interpreter. Every loop back edge is a potential trace point.
When the interpreter crosses a loop edge, TraceMonkey invokes
the trace monitor, which may decide to record or execute a native
trace. At the start of execution, there are no compiled traces yet, so
the trace monitor counts the number of times each loop back edge is
executed until a loop becomes hot, currently after 2 crossings. Note
that the way our loops are compiled, the loop edge is crossed before
entering the loop, so the second crossing occurs immediately after
the first iteration.
Here is the sequence of events broken down by outer loop
iteration:
v0 := ld state[748] // load primes from the trace activation record
st sp[0], v0 // store primes to interpreter stack
v1 := ld state[764] // load k from the trace activation record
v2 := i2f(v1) // convert k from int to double
st sp[8], v1 // store k to interpreter stack
st sp[16], 0 // store false to interpreter stack
v3 := ld v0[4] // load class word for primes
v4 := and v3, -4 // mask out object class tag for primes
v5 := eq v4, Array // test whether primes is an array
xf v5 // side exit if v5 is false
v6 := js_Array_set(v0, v2, false) // call function to set array element
v7 := eq v6, 0 // test return value from call
xt v7 // side exit if js_Array_set returns false.
Figure 3. LIR snippet for sample program. This is the LIR recorded for line 5 of the sample program in Figure 1. The LIR encodes
the semantics in SSA form using temporary variables. The LIR also encodes all the stores that the interpreter would do to its data stack.
Sometimes these stores can be optimized away as the stack locations are live only on exits to the interpreter. Finally, the LIR records guards
and side exits to verify the assumptions made in this recording: that primes is an array and that the call to set its element succeeds.
mov edx, ebx(748) // load primes from the trace activation record
mov edi(0), edx // (*) store primes to interpreter stack
mov esi, ebx(764) // load k from the trace activation record
mov edi(8), esi // (*) store k to interpreter stack
mov edi(16), 0 // (*) store false to interpreter stack
mov eax, edx(4) // (*) load object class word for primes
and eax, -4 // (*) mask out object class tag for primes
cmp eax, Array // (*) test whether primes is an array
jne side_exit_1 // (*) side exit if primes is not an array
sub esp, 8 // bump stack for call alignment convention
push false // push last argument for call
push esi // push first argument for call
call js_Array_set // call function to set array element
add esp, 8 // clean up extra stack space
mov ecx, ebx // (*) created by register allocator
test eax, eax // (*) test return value of js_Array_set
je side_exit_2 // (*) side exit if call failed
...
side_exit_1:
mov ecx, ebp(-4) // restore ecx
mov esp, ebp // restore esp
jmp epilog // jump to ret statement
Figure 4. x86 snippet for sample program. This is the x86 code compiled from the LIR snippet in Figure 3. Most LIR instructions compile
to a single x86 instruction. Instructions marked with (*) would be omitted by an idealized compiler that knew that none of the side exits
would ever be taken. The 17 instructions generated by the compiler compare favorably with the 100+ instructions that the interpreter would
execute for the same code snippet, including 4 indirect jumps.
i=2. This is the first iteration of the outer loop. The loop on
lines 4-5 becomes hot on its second iteration, so TraceMonkey en-
ters recording mode on line 4. In recording mode, TraceMonkey
records the code along the trace in a low-level compiler intermedi-
ate representation we call LIR. The LIR trace encodes all the oper-
ations performed and the types of all operands. The LIR trace also
encodes guards, which are checks that verify that the control flow
and types are identical to those observed during trace recording.
Thus, on later executions, if and only if all guards are passed, the
trace has the required program semantics.
TraceMonkey stops recording when execution returns to the
loop header or exits the loop. In this case, execution returns to the
loop header on line 4.
After recording is finished, TraceMonkey compiles the trace to
native code using the recorded type information for optimization.
The result is a native code fragment that can be entered if the
interpreter PC and the types of values match those observed when
trace recording was started. The first trace in our example, T
45
,
covers lines 4 and 5. This trace can be entered if the PC is at line 4,
i and k are integers, and primes is an object. After compiling T
45
,
TraceMonkey returns to the interpreter and loops back to line 1.
i=3. Now the loop header at line 1 has become hot, so Trace-
Monkey starts recording. When recording reaches line 4, Trace-
Monkey observes that it has reached an inner loop header that al-
ready has a compiled trace, so TraceMonkey attempts to nest the
inner loop inside the current trace. The first step is to call the inner
trace as a subroutine. This executes the loop on line 4 to completion
and then returns to the recorder. TraceMonkey verifies that the call
was successful and then records the call to the inner trace as part of
the current trace. Recording continues until execution reaches line
1, and at which point TraceMonkey finishes and compiles a trace
for the outer loop, T
16
.
i=4. On this iteration, TraceMonkey calls T
16
. Because i=4, the
if statement on line 2 is taken. This branch was not taken in the
original trace, so this causes T
16
to fail a guard and take a side exit.
The exit is not yet hot, so TraceMonkey returns to the interpreter,
which executes the continue statement.
i=5. TraceMonkey calls T
16
, which in turn calls the nested trace
T
45
. T
16
loops back to its own header, starting the next iteration
without ever returning to the monitor.
i=6. On this iteration, the side exit on line 2 is taken again. This
time, the side exit becomes hot, so a trace T
23,1
is recorded that
covers line 3 and returns to the loop header. Thus, the end of T
23,1
jumps directly to the start of T
16
. The side exit is patched so that
on future iterations, it jumps directly to T
23,1
.
At this point, TraceMonkey has compiled enough traces to cover
the entire nested loop structure, so the rest of the program runs
entirely as native code.
3. Trace Trees
In this section, we describe traces, trace trees, and how they are
formed at run time. Although our techniques apply to any dynamic
language interpreter, we will describe them assuming a bytecode
interpreter to keep the exposition simple.
3.1 Traces
A trace is simply a program path, which may cross function call
boundaries. TraceMonkey focuses on loop traces, that originate at
a loop edge and represent a single iteration through the associated
loop.
Similar to an extended basic block, a trace is only entered at
the top, but may have many exits. In contrast to an extended basic
block, a trace can contain join nodes. Since a trace always only
follows one single path through the original program, however, join
nodes are not recognizable as such in a trace and have a single
predecessor node like regular nodes.
A typed trace is a trace annotated with a type for every variable
(including temporaries) on the trace. A typed trace also has an entry
type map giving the required types for variables used on the trace
before they are defined. For example, a trace could have a type map
(x: int, b: boolean), meaning that the trace may be entered
only if the value of the variable x is of type int and the value of b
is of type boolean. The entry type map is much like the signature
of a function.
In this paper, we only discuss typed loop traces, and we will
refer to them simply as “traces”. The key property of typed loop
traces is that they can be compiled to efficient machine code using
the same techniques used for typed languages.
In TraceMonkey, traces are recorded in trace-flavored SSA LIR
(low-level intermediate representation). In trace-flavored SSA (or
TSSA), phi nodes appear only at the entry point, which is reached
both on entry and via loop edges. The important LIR primitives
are constant values, memory loads and stores (by address and
offset), integer operators, floating-point operators, function calls,
and conditional exits. Type conversions, such as integer to double,
are represented by function calls. This makes the LIR used by
TraceMonkey independent of the concrete type system and type
conversion rules of the source language. The LIR operations are
generic enough that the backend compiler is language independent.
Figure 3 shows an example LIR trace.
Bytecode interpreters typically represent values in a various
complex data structures (e.g., hash tables) in a boxed format (i.e.,
with attached type tag bits). Since a trace is intended to represent
efficient code that eliminates all that complexity, our traces oper-
ate on unboxed values in simple variables and arrays as much as
possible.
A trace records all its intermediate values in a small activation
record area. To make variable accesses fast on trace, the trace also
imports local and global variables by unboxing them and copying
them to its activation record. Thus, the trace can read and write
these variables with simple loads and stores from a native activation
recording, independently of the boxing mechanism used by the
interpreter. When the trace exits, the VM boxes the values from
this native storage location and copies them back to the interpreter
structures.
For every control-flow branch in the source program, the
recorder generates conditional exit LIR instructions. These instruc-
tions exit from the trace if required control flow is different from
what it was at trace recording, ensuring that the trace instructions
are run only if they are supposed to. We call these instructions
guard instructions.
Most of our traces represent loops and end with the special loop
LIR instruction. This is just an unconditional branch to the top of
the trace. Such traces return only via guards.
Now, we describe the key optimizations that are performed as
part of recording LIR. All of these optimizations reduce complex
dynamic language constructs to simple typed constructs by spe-
cializing for the current trace. Each optimization requires guard in-
structions to verify their assumptions about the state and exit the
trace if necessary.
Type specialization.
All LIR primitives apply to operands of specific types. Thus,
LIR traces are necessarily type-specialized, and a compiler can
easily produce a translation that requires no type dispatches. A
typical bytecode interpreter carries tag bits along with each value,
and to perform any operation, must check the tag bits, dynamically
dispatch, mask out the tag bits to recover the untagged value,
perform the operation, and then reapply tags. LIR omits everything
except the operation itself.
A potential problem is that some operations can produce values
of unpredictable types. For example, reading a property from an
object could yield a value of any type, not necessarily the type
observed during recording. The recorder emits guard instructions
that conditionally exit if the operation yields a value of a different
type from that seen during recording. These guard instructions
guarantee that as long as execution is on trace, the types of values
match those of the typed trace. When the VM observes a side exit
along such a type guard, a new typed trace is recorded originating
at the side exit location, capturing the new type of the operation in
question.
Representation specialization: objects. In JavaScript, name
lookup semantics are complex and potentially expensive because
they include features like object inheritance and eval. To evaluate
an object property read expression like o.x, the interpreter must
search the property map of o and all of its prototypes and parents.
Property maps can be implemented with different data structures
(e.g., per-object hash tables or shared hash tables), so the search
process also must dispatch on the representation of each object
found during search. TraceMonkey can simply observe the result of
the search process and record the simplest possible LIR to access
the property value. For example, the search might finds the value of
o.x in the prototype of o, which uses a shared hash-table represen-
tation that places x in slot 2 of a property vector. Then the recorded
can generate LIR that reads o.x with just two or three loads: one to
get the prototype, possibly one to get the property value vector, and
one more to get slot 2 from the vector. This is a vast simplification
and speedup compared to the original interpreter code. Inheritance
relationships and object representations can change during execu-
tion, so the simplified code requires guard instructions that ensure
the object representation is the same. In TraceMonkey, objects’ rep-
resentations are assigned an integer key called the object shape.
Thus, the guard is a simple equality check on the object shape.
Representation specialization: numbers. JavaScript has no
integer type, only a Number type that is the set of 64-bit IEEE-
754 floating-pointer numbers (“doubles”). But many JavaScript
operators, in particular array accesses and bitwise operators, really
operate on integers, so they first convert the number to an integer,
and then convert any integer result back to a double.
1
Clearly, a
JavaScript VM that wants to be fast must find a way to operate on
integers directly and avoid these conversions.
In TraceMonkey, we support two representations for numbers:
integers and doubles. The interpreter uses integer representations
as much as it can, switching for results that can only be represented
as doubles. When a trace is started, some values may be imported
and represented as integers. Some operations on integers require
guards. For example, adding two integers can produce a value too
large for the integer representation.
Function inlining. LIR traces can cross function boundaries
in either direction, achieving function inlining. Move instructions
need to be recorded for function entry and exit to copy arguments
in and return values out. These move statements are then optimized
away by the compiler using copy propagation. In order to be able
to return to the interpreter, the trace must also generate LIR to
record that a call frame has been entered and exited. The frame
entry and exit LIR saves just enough information to allow the
intepreter call stack to be restored later and is much simpler than
the interpreter’s standard call code. If the function being entered
is not constant (which in JavaScript includes any call by function
name), the recorder must also emit LIR to guard that the function
is the same.
Guards and side exits. Each optimization described above
requires one or more guards to verify the assumptions made in
doing the optimization. A guard is just a group of LIR instructions
that performs a test and conditional exit. The exit branches to a
side exit, a small off-trace piece of LIR that returns a pointer to
a structure that describes the reason for the exit along with the
interpreter PC at the exit point and any other data needed to restore
the interpreter’s state structures.
Aborts. Some constructs are difficult to record in LIR traces.
For example, eval or calls to external functions can change the
program state in unpredictable ways, making it difficult for the
tracer to know the current type map in order to continue tracing.
A tracing implementation can also have any number of other limi-
tations, e.g.,a small-memory device may limit the length of traces.
When any situation occurs that prevents the implementation from
continuing trace recording, the implementation aborts trace record-
ing and returns to the trace monitor.
3.2 Trace Trees
Especially simple loops, namely those where control flow, value
types, value representations, and inlined functions are all invariant,
can be represented by a single trace. But most loops have at least
some variation, and so the program will take side exits from the
main trace. When a side exit becomes hot, TraceMonkey starts a
new branch trace from that point and patches the side exit to jump
directly to that trace. In this way, a single trace expands on demand
to a single-entry, multiple-exit trace tree.
This section explains how trace trees are formed during execu-
tion. The goal is to form trace trees during execution that cover all
the hot paths of the program.
1
Arrays are actually worse than this: if the index value is a number, it must
be converted from a double to a string for the property access operator, and
then to an integer internally to the array implementation.
Starting a tree. Tree trees always start at loop headers, because
they are a natural place to look for hot paths. In TraceMonkey, loop
headers are easy to detect–the bytecode compiler ensures that a
bytecode is a loop header iff it is the target of a backward branch.
TraceMonkey starts a tree when a given loop header has been exe-
cuted a certain number of times (2 in the current implementation).
Starting a tree just means starting recording a trace for the current
point and type map and marking the trace as the root of a tree. Each
tree is associated with a loop header and type map, so there may be
several trees for a given loop header.
Closing the loop. Trace recording can end in several ways.
Ideally, the trace reaches the loop header where it started with
the same type map as on entry. This is called a type-stable loop
iteration. In this case, the end of the trace can jump right to the
beginning, as all the value representations are exactly as needed to
enter the trace. The jump can even skip the usual code that would
copy out the state at the end of the trace and copy it back in to the
trace activation record to enter a trace.
In certain cases the trace might reach the loop header with a
different type map. This scenario is sometime observed for the first
iteration of a loop. Some variables inside the loop might initially be
undefined, before they are set to a concrete type during the first loop
iteration. When recording such an iteration, the recorder cannot
link the trace back to its own loop header since it is type-unstable.
Instead, the iteration is terminated with a side exit that will always
fail and return to the interpreter. At the same time a new trace is
recorded with the new type map. Every time an additional type-
unstable trace is added to a region, its exit type map is compared to
the entry map of all existing traces in case they complement each
other. With this approach we are able to cover type-unstable loop
iterations as long they eventually form a stable equilibrium.
Finally, the trace might exit the loop before reaching the loop
header, for example because execution reaches a break or return
statement. In this case, the VM simply ends the trace with an exit
to the trace monitor.
As mentioned previously, we may speculatively chose to rep-
resent certain Number-typed values as integers on trace. We do so
when we observe that Number-typed variables contain an integer
value at trace entry. If during trace recording the variable is unex-
pectedly assigned a non-integer value, we have to widen the type
of the variable to a double. As a result, the recorded trace becomes
inherently type-unstable since it starts with an integer value but
ends with a double value. This represents a mis-speculation, since
at trace entry we specialized the Number-typed value to an integer,
assuming that at the loop edge we would again find an integer value
in the variable, allowing us to close the loop. To avoid future spec-
ulative failures involving this variable, and to obtain a type-stable
trace we note the fact that the variable in question as been observed
to sometimes hold non-integer values in an advisory data structure
which we call the “oracle”.
When compiling loops, we consult the oracle before specializ-
ing values to integers. Speculation towards integers is performed
only if no adverse information is known to the oracle about that
particular variable. Whenever we accidentally compile a loop that
is type-unstable due to mis-speculation of a Number-typed vari-
able, we immediately trigger the recording of a new trace, which
based on the now updated oracle information will start with a dou-
ble value and thus become type stable.
Extending a tree. Side exits lead to different paths through
the loop, or paths with different types or representations. Thus, to
completely cover the loop, the VM must record traces starting at all
side exits. These traces are recorded much like root traces: there is
a counter for each side exit, and when the counter reaches a hotness
threshold, recording starts. Recording stops exactly as for the root
trace, using the loop header of the root trace as the target to reach.
Our implementation does not extend at all side exits. It extends
only if the side exit is for a control-flow branch, and only if the side
exit does not leave the loop. In particular we do not want to extend
a trace tree along a path that leads to an outer loop, because we
want to cover such paths in an outer tree through tree nesting.
3.3 Blacklisting
Sometimes, a program follows a path that cannot be compiled
into a trace, usually because of limitations in the implementation.
TraceMonkey does not currently support recording throwing and
catching of arbitrary exceptions. This design trade off was chosen,
because exceptions are usually rare in JavaScript. However, if a
program opts to use exceptions intensively, we would suddenly
incur a punishing runtime overhead if we repeatedly try to record
a trace for this path and repeatedly fail to do so, since we abort
tracing every time we observe an exception being thrown.
As a result, if a hot loop contains traces that always fail, the VM
could potentially run much more slowly than the base interpreter:
the VM repeatedly spends time trying to record traces, but is never
able to run any. To avoid this problem, whenever the VM is about
to start tracing, it must try to predict whether it will finish the trace.
Our prediction algorithm is based on blacklisting traces that
have been tried and failed. When the VM fails to finish a trace start-
ing at a given point, the VM records that a failure has occurred. The
VM also sets a counter so that it will not try to record a trace starting
at that point until it is passed a few more times (32 in our imple-
mentation). This backoff counter gives temporary conditions that
prevent tracing a chance to end. For example, a loop may behave
differently during startup than during its steady-state execution. Af-
ter a given number of failures (2 in our implementation), the VM
marks the fragment as blacklisted, which means the VM will never
again start recording at that point.
After implementing this basic strategy, we observed that for
small loops that get blacklisted, the system can spend a noticeable
amount of time just finding the loop fragment and determining that
it has been blacklisted. We now avoid that problem by patching the
bytecode. We define an extra no-op bytecode that indicates a loop
header. The VM calls into the trace monitor every time the inter-
preter executes a loop header no-op. To blacklist a fragment, we
simply replace the loop header no-op with a regular no-op. Thus,
the interpreter will never again even call into the trace monitor.
There is a related problem we have not yet solved, which occurs
when a loop meets all of these conditions:
•
The VM can form at least one root trace for the loop.
•
There is at least one hot side exit for which the VM cannot
complete a trace.
•
The loop body is short.
In this case, the VM will repeatedly pass the loop header, search
for a trace, find it, execute it, and fall back to the interpreter.
With a short loop body, the overhead of finding and calling the
trace is high, and causes performance to be even slower than the
basic interpreter. So far, in this situation we have improved the
implementation so that the VM can complete the branch trace.
But it is hard to guarantee that this situation will never happen.
As future work, this situation could be avoided by detecting and
blacklisting loops for which the average trace call executes few
bytecodes before returning to the interpreter.
4. Nested Trace Tree Formation
Figure 7 shows basic trace tree compilation (11) applied to a nested
loop where the inner loop contains two paths. Usually, the inner
loop (with header at i
2
) becomes hot first, and a trace tree is rooted
at that point. For example, the first recorded trace may be a cycle
T
Trunk Trace
Tree Anchor
Trace Anchor
Branch Trace
Guard
Side Exit
Figure 5. A tree with two traces, a trunk trace and one branch
trace. The trunk trace contains a guard to which a branch trace was
attached. The branch trace contain a guard that may fail and trigger
a side exit. Both the trunk and the branch trace loop back to the tree
anchor, which is the beginning of the trace tree.
Trace 2
Trace 1 Trace 2Trace 1
Closed
Linked Linked Linked
Number
Number
String
String
String
String
Boolean
Trace 2Trace 1 Trace 3
Linked
Linked Linked
Closed
Number
Number Number
Boolean Number
Boolean Number
Boolean
(a)
(b)
(c)
Figure 6. We handle type-unstable loops by allowing traces to
compile that cannot loop back to themselves due to a type mis-
match. As such traces accumulate, we attempt to connect their loop
edges to form groups of trace trees that can execute without having
to side-exit to the interpreter to cover odd type cases. This is par-
ticularly important for nested trace trees where an outer tree tries to
call an inner tree (or in this case a forest of inner trees), since inner
loops frequently have initially undefined values which change type
to a concrete value after the first iteration.
through the inner loop, {i
2
, i
3
, i
5
, α}. The α symbol is used to
indicate that the trace loops back the tree anchor.
When execution leaves the inner loop, the basic design has two
choices. First, the system can stop tracing and give up on compiling
the outer loop, clearly an undesirable solution. The other choice is
to continue tracing, compiling traces for the outer loop inside the
inner loop’s trace tree.
For example, the program might exit at i
5
and record a branch
trace that incorporates the outer loop: {i
5
, i
7
, i
1
, i
6
, i
7
, i
1
, α}.
Later, the program might take the other branch at i
2
and then
exit, recording another branch trace incorporating the outer loop:
{i
2
, i
4
, i
5
, i
7
, i
1
, i
6
, i
7
, i
1
, α}. Thus, the outer loop is recorded and
compiled twice, and both copies must be retained in the trace cache.
i2
i3 i4
i5
i1
i6
i7
t1
t2
Tree Call
Outer Tree
Nested Tree
Exit Guard
(a)
(b)
Figure 7. Control flow graph of a nested loop with an if statement
inside the inner most loop (a). An inner tree captures the inner
loop, and is nested inside an outer tree which “calls” the inner tree.
The inner tree returns to the outer tree once it exits along its loop
condition guard (b).
In general, if loops are nested to depth k, and each loop has n paths
(on geometric average), this na¨ıve strategy yields O(n
k
) traces,
which can easily fill the trace cache.
In order to execute programs with nested loops efficiently, a
tracing system needs a technique for covering the nested loops with
native code without exponential trace duplication.
4.1 Nesting Algorithm
The key insight is that if each loop is represented by its own trace
tree, the code for each loop can be contained only in its own tree,
and outer loop paths will not be duplicated. Another key fact is that
we are not tracing arbitrary bytecodes that might have irreduceable
control flow graphs, but rather bytecodes produced by a compiler
for a language with structured control flow. Thus, given two loop
edges, the system can easily determine whether they are nested
and which is the inner loop. Using this knowledge, the system can
compile inner and outer loops separately, and make the outer loop’s
traces call the inner loop’s trace tree.
The algorithm for building nested trace trees is as follows. We
start tracing at loop headers exactly as in the basic tracing system.
When we exit a loop (detected by comparing the interpreter PC
with the range given by the loop edge), we stop the trace. The
key step of the algorithm occurs when we are recording a trace
for loop L
R
(R for loop being recorded) and we reach the header
of a different loop L
O
(O for other loop). Note that L
O
must be an
inner loop of L
R
because we stop the trace when we exit a loop.
•
If L
O
has a type-matching compiled trace tree, we call L
O
as
a nested trace tree. If the call succeeds, then we record the call
in the trace for L
R
. On future executions, the trace for L
R
will
call the inner trace directly.
•
If L
O
does not have a type-matching compiled trace tree yet,
we have to obtain it before we are able to proceed. In order
to do this, we simply abort recording the first trace. The trace
monitor will see the inner loop header, and will immediately
start recording the inner loop.
2
If all the loops in a nest are type-stable, then loop nesting creates
no duplication. Otherwise, if loops are nested to a depth k, and each
2
Instead of aborting the outer recording, we could principally merely sus-
pend the recording, but that would require the implementation to be able
to record several traces simultaneously, complicating the implementation,
while saving only a few iterations in the interpreter.
i2
i3
i1
i6
i4
i5
t2
t1
t4
Exit Guard
Nested Tree
Figure 8. Control flow graph of a loop with two nested loops (left)
and its nested trace tree configuration (right). The outer tree calls
the two inner nested trace trees and places guards at their side exit
locations.
loop is entered with m different type maps (on geometric average),
then we compile O(m
k
) copies of the innermost loop. As long as
m is close to 1, the resulting trace trees will be tractable.
An important detail is that the call to the inner trace tree must act
like a function call site: it must return to the same point every time.
The goal of nesting is to make inner and outer loops independent;
thus when the inner tree is called, it must exit to the same point
in the outer tree every time with the same type map. Because we
cannot actually guarantee this property, we must guard on it after
the call, and side exit if the property does not hold. A common
reason for the inner tree not to return to the same point would
be if the inner tree took a new side exit for which it had never
compiled a trace. At this point, the interpreter PC is in the inner
tree, so we cannot continue recording or executing the outer tree.
If this happens during recording, we abort the outer trace, to give
the inner tree a chance to finish growing. A future execution of the
outer tree would then be able to properly finish and record a call to
the inner tree. If an inner tree side exit happens during execution of
a compiled trace for the outer tree, we simply exit the outer trace
and start recording a new branch in the inner tree.
4.2 Blacklisting with Nesting
The blacklisting algorithm needs modification to work well with
nesting. The problem is that outer loop traces often abort during
startup (because the inner tree is not available or takes a side exit),
which would lead to their being quickly blacklisted by the basic
algorithm.
The key observation is that when an outer trace aborts because
the inner tree is not ready, this is probably a temporary condition.
Thus, we should not count such aborts toward blacklisting as long
as we are able to build up more traces for the inner tree.
In our implementation, when an outer tree aborts on the inner
tree, we increment the outer tree’s blacklist counter as usual and
back off on compiling it. When the inner tree finishes a trace, we
decrement the blacklist counter on the outer loop, “forgiving” the
outer loop for aborting previously. We also undo the backoff so that
the outer tree can start immediately trying to compile the next time
we reach it.
5. Trace Tree Optimization
This section explains how a recorded trace is translated to an
optimized machine code trace. The trace compilation subsystem,
NANOJIT, is separate from the VM and can be used for other
applications.
5.1 Optimizations
Because traces are in SSA form and have no join points or φ-
nodes, certain optimizations are easy to implement. In order to
get good startup performance, the optimizations must run quickly,
so we chose a small set of optimizations. We implemented the
optimizations as pipelined filters so that they can be turned on and
off independently, and yet all run in just two loop passes over the
trace: one forward and one backward.
Every time the trace recorder emits a LIR instruction, the in-
struction is immediately passed to the first filter in the forward
pipeline. Thus, forward filter optimizations are performed as the
trace is recorded. Each filter may pass each instruction to the next
filter unchanged, write a different instruction to the next filter, or
write no instruction at all. For example, the constant folding filter
can replace a multiply instruction like v
13
:= mul3, 1000 with a
constant instruction v
13
= 3000.
We currently apply four forward filters:
•
On ISAs without floating-point instructions, a soft-float filter
converts floating-point LIR instructions to sequences of integer
instructions.
•
CSE (constant subexpression elimination),
•
expression simplification, including constant folding and a few
algebraic identities (e.g., a − a = 0), and
•
source language semantic-specific expression simplification,
primarily algebraic identities that allow DOUBLE to be replaced
with INT. For example, LIR that converts an INT to a DOUBLE
and then back again would be removed by this filter.
When trace recording is completed, nanojit runs the backward
optimization filters. These are used for optimizations that require
backward program analysis. When running the backward filters,
nanojit reads one LIR instruction at a time, and the reads are passed
through the pipeline.
We currently apply three backward filters:
•
Dead data-stack store elimination. The LIR trace encodes many
stores to locations in the interpreter stack. But these values are
never read back before exiting the trace (by the interpreter or
another trace). Thus, stores to the stack that are overwritten
before the next exit are dead. Stores to locations that are off
the top of the interpreter stack at future exits are also dead.
•
Dead call-stack store elimination. This is the same optimization
as above, except applied to the interpreter’s call stack used for
function call inlining.
•
Dead code elimination. This eliminates any operation that
stores to a value that is never used.
After a LIR instruction is successfully read (“pulled”) from
the backward filter pipeline, nanojit’s code generator emits native
machine instruction(s) for it.
5.2 Register Allocation
We use a simple greedy register allocator that makes a single
backward pass over the trace (it is integrated with the code gen-
erator). By the time the allocator has reached an instruction like
v
3
= add v
1
, v
2
, it has already assigned a register to v
3
. If v
1
and
v
2
have not yet been assigned registers, the allocator assigns a free
register to each. If there are no free registers, a value is selected for
spilling. We use a class heuristic that selects the “oldest” register-
carried value (6).
The heuristic considers the set R of values v in registers imme-
diately after the current instruction for spilling. Let v
m
be the last
instruction before the current where each v is referred to. Then the
Tag
JS Type Description
xx1 number 31-bit integer representation
000 object pointer to JSObject handle
010 number pointer to double handle
100 string pointer to JSString handle
110
boolean enumeration for null, undefined, true, false
null, or
undefined
Figure 9. Tagged values in the SpiderMonkey JS interpreter.
Testing tags, unboxing (extracting the untagged value) and boxing
(creating tagged values) are significant costs. Avoiding these costs
is a key benefit of tracing.
heuristic selects v with minimum v
m
. The motivation is that this
frees up a register for as long as possible given a single spill.
If we need to spill a value v
s
at this point, we generate the
restore code just after the code for the current instruction. The
corresponding spill code is generated just after the last point where
v
s
was used. The register that was assigned to v
s
is marked free for
the preceding code, because that register can now be used freely
without affecting the following code
6. Implementation
To demonstrate the effectiveness of our approach, we have im-
plemented a trace-based dynamic compiler for the SpiderMonkey
JavaScript Virtual Machine (4). SpiderMonkey is the JavaScript
VM embedded in Mozilla’s Firefox open-source web browser (2),
which is used by more than 200 million users world-wide. The core
of SpiderMonkey is a bytecode interpreter implemented in C++.
In SpiderMonkey, all JavaScript values are represented by the
type jsval. A jsval is machine word in which up to the 3 of the
least significant bits are a type tag, and the remaining bits are data.
See Figure 6 for details. All pointers contained in jsvals point to
GC-controlled blocks aligned on 8-byte boundaries.
JavaScript object values are mappings of string-valued property
names to arbitrary values. They are represented in one of two ways
in SpiderMonkey. Most objects are represented by a shared struc-
tural description, called the object shape, that maps property names
to array indexes using a hash table. The object stores a pointer to
the shape and the array of its own property values. Objects with
large, unique sets of property names store their properties directly
in a hash table.
The garbage collector is an exact, non-generational, stop-the-
world mark-and-sweep collector.
In the rest of this section we discuss key areas of the TraceMon-
key implementation.
6.1 Calling Compiled Traces
Compiled traces are stored in a trace cache, indexed by intepreter
PC and type map. Traces are compiled so that they may be
called as functions using standard native calling conventions (e.g.,
FASTCALL on x86).
The interpreter must hit a loop edge and enter the monitor in
order to call a native trace for the first time. The monitor computes
the current type map, checks the trace cache for a trace for the
current PC and type map, and if it finds one, executes the trace.
To execute a trace, the monitor must build a trace activation
record containing imported local and global variables, temporary
stack space, and space for arguments to native calls. The local and
global values are then copied from the interpreter state to the trace
activation record. Then, the trace is called like a normal C function
pointer.
When a trace call returns, the monitor restores the interpreter
state. First, the monitor checks the reason for the trace exit and
applies blacklisting i