How Effective are Classic Lookup Optimizations for Rails Apps?

We know that Ruby and especially Rails applications can be very dynamic and pretty large. Though, many of the optimizations interpreters and even just-in-time compilers use have been invented in the 1980s and 1990s before Ruby and Rails even existed. So, I was wondering: do these optimizations still have a chance of coping with the millions of lines of Ruby code that large Rails apps from Shopify, Stripe, or GitLab have?

As part of her research, Sophie wrote a paper investigating the behavior of method call sites in detail. She looked at how well optimizations such as lookup caches, target duplicate elimination, and splitting apply to modern Ruby code. I’ll use the work here as a foundation and zoom in on the Rails apps we looked at. For all details including the measurement methodology, I’ll defer to sec. 3 of the paper. It also discusses how Sophie instrumented TruffleRuby and how the data was processed.

BlogRails, ERubiRails, the Liquid Benchmarks

The benchmarks I am going to be focusing on are called BlogRails, ERubiRails, LiquidCartRender, and LiquidRenderBibs. BlogRails, usually referred to as railsbench, is a small Ruby on Rails application, simulating a basic blog, as created by Rails’ scaffold generator. The benchmark accesses existing blog posts and creates new ones. The ERubiRails is a similarly small Rails app and renders an ERB template from the Discourse project.

I also included two Liquid template language benchmarks here out of curiosity. LiquidCartRender uses Liquid to render an HTML page for a shopping cart. LiquidRenderBibs renders an HTML page with a list of papers that have a variety of different data bits to be shown (specifically this one here).

            Poly. and Used Poly. and
    Statement   Function Calls Megamorphic Call Megamorphic
Benchmark Statements Coverage Functions Coverage (in 1000) Calls Sites Call Sites
BlogRails 118,717 48% 37,595 38% 13,863 7.4% 52,361 2.3%
ERubiRails 117,922 45% 37,328 35% 12,309 5.4% 47,794 2.3%
LiquidCartRender 23,562 39% 6,269 30% 236 5.5% 3,581 2.4%
LiquidRenderBibs 23,277 39% 6,185 29% 385 23.4% 3,466 2.8%

As the table above shows, the Rails benchmarks have about 120,000 Ruby statements each, of which 45-48% are executed. Of the circa 37,500 functions, about 35-38% are executed. In total, the BlogRails benchmark makes about 13,863,000 function calls. 7.4% of these calls are polymorphic or megamorphic.

In Ruby, a call site is considered to be monomorphic, if there is a single receiver class seen during execution, which also means there’s usually a single method that is being called. When there is more than one different receiver type, we call the call site polymorphic. Once there were more than a certain number of receiver types, a call site is megamorphic. In TruffleRuby, this happens when more than 8 different receiver types were used at the call site. Though, this is a bit of a simplification, and we’ll get into more details in the next section.

Until then we can observer that ERubiRails seems a bit less polymorphic. Only 5.4% of its calls are polymorphic or megamorphic.

The Liquid benchmarks are much smaller, with only about 23,500 statements in about 6,200 functions. The number of calls being between 236,000 and 385,000 is also significantly smaller. Surprisingly, about 23% of all calls in the LiquidRenderBibs benchmark are polymorphic. While I haven’t looked into it in more detail, I would assume that this might be an artifact of the template having to handle a large number of differences in the input data.

Compared to other languages, these numbers do not feel too different. For instance, the Dacapo Con Scala project found somewhat similar numbers for Java and Scala. In the Scala benchmarks they looked at, 89.7% of all calls were monomorphic. The Java benchmarks had about 91.5% of all calls being monomorphic.

This means what we see for Rails is roughly in line with what one would expect. This is good news, because it means that the classic optimizations are likely going to work as expected.

But before getting too enthusiastic, let’s dig a little deeper to see whether that is indeed the case.

Receiver versus Target Polymorphism

Let’s take a very simple, Rails-like example as a starting point. The following code shows the ApplicationController defining the status method, which simply returns an HTTP status code.

We also define an ArticlesController as subclass of ApplicationController. The ArticlesController implements the index method, which for brevity is kept empty.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
class ApplicationController
  def status
    200
  end
end

class ArticlesController < ApplicationController
  def index
  end
end

controllers = [
  ArticlesController.new,
  ApplicationController.new
]
controllers.select { |c| c.status == 200 }

At the end of the example on line 16, we have an array with both controllers, and select the ones with the status code being 200. The call to the status method is receiver-polymorphic. This means the call site sees multiple different receiver types, in our case the two controllers. Though, at the same time, the call site is target-monomorphic. This means, there’s only a single method that is activated.

TruffleRuby optimizes this case by using two polymorphic inline caches or more accurately dispatch chains, one after another, as depicted in fig. 1.

Fig. 1. Optimizing Method Dispatch
with two Consecutive Dispatch Chains to Eliminate Duplicate Targets.

By using two dispatch chains, a language implementation can often turn a receiver-polymorphic call site into a target-monomorphic one. The first dispatch chain acts as classic lookup cache. It takes the receiver type1 1 Since TruffleRuby uses object shapes to optimize objects, they would be used as a proxy for the receiver type. and caches the method as the result of a lookup. The second cache deduplicates the target methods and in the case of TruffleRuby, it caches Truffle’s call nodes, which implement the method activation, but also optimizations such as splitting.

Based on our data, eliminating duplicate targets is also an effective optimization for Rails:

Number of Calls After Eliminating Duplicate Targets
Benchmark Polymorphic Megamorphic Polymorphic Megamorphic
BlogRails 956,515 63,319 -48.8% -99.1%
ERubiRails 626,535 40,699 -37.4% -98.6%
LiquidCartRender 12,598 280 -73.3% -100.0%
LiquidRenderBibs 89,866 280 -73.7% -100.0%

The table above gives the absolute number of calls these benchmarks do. As we can see in column two and three there are relatively few megamorphic calls to begin with. In TruffleRuby, a call site is megamorphic when there are more than 8 different receivers or target methods. Megamorphic calls can be a performance issue, especially when class hierarchies are deep and method lookup is costly, because for such megamorphic calls, we cannot cache the lookup results.

The good news is that eliminating of duplicate targets is highly effective in avoiding megamorphic calls. As we can see in column five, most calls stop being megamorphic. However, the optimization is much less effective in avoiding polymorphic calls, reducing their number by only 37.4-48.8%. This means, that about 50-60% of calls are still polymorphic.

For a basic interpreter, this isn’t too bad, because we still avoid the overhead of a lookup. However, for TruffleRuby with its just-in-time compilation, this situation is not ideal, because method inlining, i.e., replacing a call by taking the method body and integrating it into the caller during compilation, is limited.

On the positive side, our Liquid render benchmarks benefit nicely here. While I haven’t looked in detail, the number of megamorphic calls being the same suggests that these calls are made in the initial setup and eliminating duplicate targets prevents them from being megamorphic.

Method Splitting

TruffleRuby uses an optimization that is not as common in just-in-time compiling systems: method splitting. Most just-in-time compilers rely solely on inlining to enable classic compiler optimizations and get efficient native code. Though, since TruffleRuby builds on the Truffle framework with its metacompilation approach, it tries harder to optimize even before the just-in-time compilation kicks in.

Truffle’s method splitting copies a method in a state that is uninitialized. For us most importantly this means, it copies the method without the lookup cache entries as illustrated in fig. 2. The split method, i.e. the copy, is then associated with a single call site. The idea is that this copy specializes itself in the context of this single caller, which hopefully means method calls are more likely to be monomorphic.

Fig. 2. Method Splitting copies a method for use at a specific call site. The copy is uninitialized, which means the dispatch chains do not contain any entries yet.

So, is splitting succeeding at monomorphizing call sites? Let’s look at the data. Note, we already eliminated duplicate targets. Thus, the numbers are a little smaller here.

Number of Calls (w/o duplicate targets) After Splitting
Benchmark Polymorphic Megamorphic Polymorphic Megamorphic
BlogRails 490,072 557 -100% -100%
ERubiRails 391,997 553 -100% -100%
LiquidCartRender 2,000 0 -100% n/a
LiquidRenderBibs 23,633 0 -100% n/a

Indeed, splitting is highly effective in turning polymorphic and megamorphic calls into monomorphic calls, which allows the just-in-time compiler to aggressively inline and optimize the Ruby code.

Overall Monomorphization

As we have seen in the last table, the polymorphic and megamorphic calls were all monomorphized. Though, let’s take a slightly different look at the data. Instead of looking at the run-time calls, let’s look at how many targets there are at a call site.

Maximum Number of Targets
before target duplicates after
Benchmark optimizations eliminated splitting
BlogRails 206 24 2
ERubiRails 206 24 2
LiquidCartRender 20 5 1
LiquidRenderBibs 20 7 1

From this table we can see that the Rails benchmarks have at least one call site with 206 different receiver types. After eliminate duplicate target methods, we see at most 24 different targets. Adding splitting to the system reduces it further to at most 2 entries. As we saw from the number of run-time calls, these optimizations in combination are indeed highly effective for Rails applications.

From this brief look, we can conclude that despite these optimizations having been invented some 30-40 years ago, they are still highly effective even for today’s dynamic Ruby on Rails systems.

Who You Gonna Call: Analyzing the Run-time Call-Site Behavior of Ruby Applications

In our paper, we go into many more details. We also look at how blocks behave (spoiler: they turn out to be slightly more polymorphic than methods). We investigate how lookup caches evolve over time and find patterns that may help us to improve performance further in the future. We also noticed that TruffleRuby’s splitting is a little too enthusiastic. For instance, blocks/Procs were always split, which has been fixed already. There is more work to be done to see whether splitting can further be reduced to avoid redundant work at run time. Though, that’s for another day.

Meanwhile, please give the paper a read, attend our presentation at DLS, and find us with questions, comments, and suggestions on Twitter @SophieKaleba and @smarr.

Abstract

Applications written in dynamic languages are becoming larger and larger and companies increasingly use multi-million line codebases in production. At the same time, dynamic languages rely heavily on dynamic optimizations, particularly those that reduce the overhead of method calls.

In this work, we study the call-site behavior of Ruby benchmarks that are being used to guide the development of upcoming Ruby implementations such as TruffleRuby and YJIT. We study the interaction of call-site lookup caches, method splitting, and elimination of duplicate call-targets.

We find that these optimizations are indeed highly effective on both smaller and large benchmarks, methods and closures alike, and help to open up opportunities for further optimizations such as inlining. However, we show that TruffleRuby’s splitting may be applied too aggressively on already-monomorphic call-sites, coming at a run-time cost. We also find three distinct patterns in the evolution of call-site behavior over time, which may help to guide novel optimizations. We believe that our results may support language implementers in optimizing runtime systems for large codebases built in dynamic languages.

  • Who You Gonna Call: Analyzing the Run-time Call-Site Behavior of Ruby Applications
    S. Kaleba, O. Larose, R. Jones, S. Marr; In Proceedings of the 18th Symposium on Dynamic Languages, DLS'22, p. 14, ACM, 2022.
  • Paper: PDF
  • DOI: 10.1145/3563834.3567538
  • BibTex: bibtex
    @inproceedings{Kaleba:2022:CallSites,
      abstract = {Applications written in dynamic languages are becoming larger and larger and companies increasingly use multi-million line codebases in production. At the same time, dynamic languages rely heavily on dynamic optimizations, particularly those that reduce the overhead of method calls.
      
      In this work, we study the call-site behavior of Ruby benchmarks that are being used to guide the development of upcoming Ruby implementations such as TruffleRuby and YJIT. We study the interaction of call-site lookup caches, method splitting, and elimination of duplicate call-targets.
      
      We find that these optimizations are indeed highly effective on both smaller and large benchmarks, methods and closures alike, and help to open up opportunities for further optimizations such as inlining. However, we show that TruffleRuby's splitting may be applied too aggressively on already-monomorphic call-sites, coming at a run-time cost. We also find three distinct patterns in the evolution of call-site behavior over time, which may help to guide novel optimizations. We believe that our results may support language implementers in optimizing runtime systems for large codebases built in dynamic languages.},
      acceptancerate = {0.4},
      author = {Kaleba, Sophie and Larose, Octave and Jones, Richard and Marr, Stefan},
      booktitle = {Proceedings of the 18th Symposium on Dynamic Languages},
      day = {7},
      doi = {10.1145/3563834.3567538},
      keywords = {Analysis CallSite DynamicLanguages Inlining LookupCache MeMyPublication Splitting myown},
      location = {Auckland, New Zealand},
      month = dec,
      note = {(acceptance rate 40%)},
      pages = {14},
      pdf = {https://stefan-marr.de/downloads/dls22-kaleba-et-al-analyzing-the-run-time-call-site-behavior-of-ruby-applications.pdf},
      publisher = {ACM},
      series = {DLS'22},
      title = {Who You Gonna Call: Analyzing the Run-time Call-Site Behavior of Ruby Applications},
      year = {2022},
      month_numeric = {12}
    }
    

Acknowledgments

Thanks to Sophie, Octave, and Chris Seaton for suggestions and corrections on this blog post.

Reducing Memory Footprint by Minimizing Hidden Class Graphs

Tomoharu noticed in his work on the eJSVM, a JavaScript virtual machine for embedded systems, that quite a bit of memory is needed for the data that helps us to represent JavaScript objects efficiently. So, we started to look into how the memory use could be reduced without sacrificing performance.

Objects in Dynamic Languages

In languages like JavaScript, Python, and Ruby, objects are much more flexible than in many other languages including Java, C++, or C#, because fields can be added and possibly even removed dynamically as needed.

I’ll use JavaScript for our examples. Let’s imagine we are working with a sensor that can determine location and movement information. Not all data may be available at every point when we access the sensor. Indeed, most likely we may just have the current longitude and latitude. Perhaps sometimes, we have access to precise GPS coordinates, which also give us altitude. In even more rare cases, the sensor might even be moving, which gives us a bearing and speed. If we imagine this in code, we might get something that looks vaguely like the following code.

1
2
3
4
5
6
7
8
9
10
11
12
let location = {}
location.longitude = 51.28;  // getLong()
location.latitude  =  1.08;  // getLat()

if (hasAltitude()) {
  location.altitude = getAltitude();
}

if (isMoving()) {
  location.bearing = getBearing();
  location.speed   = getSpeed();
}

Thus, when we access the sensor, we create a new JavaScript object and then add the longitude and latitude fields. Depending on the available data, we may still add the field for altitude as well as bearing and speed. However, if the data is not available, our JavaScript object will only have longitude and latitude.

In addition to adding fields arbitrarily, in JavaScript, we can even delete them. So, how do our modern language implementations implement this efficiently?

Hidden Class: Finding Structure in Dynamic Programs

The most direct way to implement objects, where one can add and remove fields arbitrarily, would probably be some kind of hash table or a list of field names and their values. However, neither of these two approaches is as efficient as directly accessing a known memory offset for a specific field as it’s possible in languages with less flexible objects.

To gain the same efficiency in dynamic languages, hidden classes were invented. The key idea is that a language implementation can determine at run time the structure of objects, create a kind of map or hidden class that can tell us which fields an object has, possibly even the types stored in a field, and most importantly where the field can be found in memory. This works well because most code is much less dynamic than what the language would allow for.

For our example code from above, fig. 1 shows us how this may look like in an implementation. We start out with our location object being empty. The object only contains is a pointer to an empty hidden class. Once we execute the code on line 2, the longitude field is added. We store the value 51.28 into an array that we use to store all field values. Since it’s the first, it’s stored at index 0, and we record this in the hidden class. However, we really want to be able to reuse hidden classes easily. So, instead of changing the existing hidden class, we create a new one, which records longitude being stored at index 0.

Basically the same happens when we execute line 3, and add the latitude to the object. We need to expand the array by one slot to hold the value 1.08, and create a new hidden class that includes that latitude is stored at index 1.

The next time we would execute those lines again, we wouldn’t actually need to create new hidden classes but can lookup them up based on the fields that we are adding.

Fig. 1. When a field is added to an object, the object needs to change the hidden class, which says where the field is to be stored. We may also need to expand the array to have enough space to store the field's value.

Though, so far, we only looked at the first three lines of code. Lines 6, 10, and 11 add more fields, but do so conditionally. Focusing only on the graph of hidden classes, fig. 2 shows what that would look like.

Fig. 2. Hidden class graph for our example program. For the case that there's no altitude but movement information, the hidden class graph has a branch since the altitude field is not present.

The first three hidden classes are the same as before. In the case that we have neither an altitude nor a movement, we simply stay in the third hidden class. However, if we have any of these additional details, the so-called hidden class graph would further evolve. In this particular case, we would even introduce a branch depending on whether we have the altitude details. If we first have the altitude, in the simplest kind of hidden class graph, we would end up with bearing and speed being stored at different indexes in the object.

Optimizing Hidden Class Graphs

In his previous work on the eJSVM, Tomoharu already relied on JavaScript programs showing fairly stable patterns in their behavior. This means, the difference based on user input and between different runs of a program is relatively minor, when one observes a good sample of program executions. Thus, we came up with the idea of using a classic profile-guided optimization approach to optimize the hidden class graph of an application.

The basic idea is that during execution of a representative set of runs, we can use the garbage collector to gather statistics about the kind of objects used in a program, their hidden classes, and which of the hidden classes are used most. With this information, we can optimize the hidden class graph of a program for future executions.

Specifically, we apply the following optimizations:

  1. move branches in the graph to reduce memory use for the most-used hidden classes
  2. eliminate hidden classes that are only used temporarily before changing to another one
  3. merge identical branches of the graph from unrelated allocation sites

Let’s go through these optimizations step by step. For our example, let’s assume we profiled a number of representative executions and found out that indeed altitude and movement are rarely available. Thus, most objects will only have longitude and latitude fields. Furthermore, our data also tells us that it basically never happens that we have movement information without also having the altitude.

This means, we can apply our first optimization, and “move” the branch for adding altitude in the hidden class graph. Figure 3 highlights the change in red. By moving the branch for adding altitude to the hidden class after adding bearing and speed, we can merge the two branches since they are now identical. This means, the dotted hidden classes in the figure can be dropped. In the paper, we go into a bit more detail of how to make this correct without changing JavaScript semantics.

Fig. 3. By moving the rarely used branch without altitude information, the two branches become identical, and we can drop the dotted elements from the graph.

As the second optimization, we eliminate “temporary” hidden classes, which are rarely used over longer time spans. The prime example for these temporary hidden classes are the ones that are only used between adding fields one after another. As highlighted with dotted lines in fig. 4, the hidden class between adding longitude and latitude, as well as the ones between adding bearing, speed, and finally altitude can be removed. This leaves in the end, only three hidden classes in our graph.

Fig. 4. By removing intermediate classes which are only in use very briefly, the hidden class graph can be shrunk to just three hidden classes.

The third optimization is relevant for larger programs and not directly visible in our example. However, often different code paths would create the same kind of objects. Thus, the hidden classes would be basically the same, which allows us to merge these identical parts of the hidden class graph.

Results

For the evaluation, we used a variation of the Are We Fast Yet benchmarks. Here, I’ll just quickly look at the memory savings.

For these measurements, we first collected the profiling information for each benchmark, optimized the hidden class graph, and then used the result to guide the execution.

As can be seen in fig. 5, the optimized hidden class graphs are quite a bit smaller. We reduce their memory use by about 62% on average.

Fig. 5. Memory used by hidden classes and related data structures. Overall, this meta data uses about 62% less memory on average with our optimizations.

By reducing the overall number of hidden classes in the system, we also noticed a few speedups in our benchmarks. Figure 6 shows that the reducing in hidden classes reduces the cache misses for eJSVM’s single-entry lookup caches, especially for larger benchmarks such as CD and Havlak. However, there are other effects. For instance with the hidden classes known up front, we can size objects correctly on allocation, avoid the extra array to store the fields, and reduce the number of times the array needs to be expanded because of frequent transitions.

Fig. 6. Reduction in lookup/inline cache misses. eJSVM has single-entry lookup caches, which means the overall reduction in hidden classes can improve performance.

Profile Guided Offline Optimization of Hidden Class Graphs for JavaScript VMs in Embedded Systems

In the paper, we discuss a lot more details, background, and evaluation. Of course, there are various corner cases to be considered, for example things like JavaScript’s prototype objects, how to make sure that the JavaScript semantics don’t break even though the hidden classes change, and a few other bits.

Please give the paper a read, attend our presentation at VMIL’22, and find some of us for questions, comments, and suggestions on Twitter @profrejones and @smarr.

Abstract

JavaScript is increasingly used for the Internet of Things (IoT) on embedded systems. However, JavaScript’s memory footprint is a challenge, because normal JavaScript virtual machines (VMs) do not fit into the small memory of IoT devices. In part this is because a significant amount of memory is used by hidden classes, which are used to represent JavaScript’s dynamic objects efficiently.

In this research, we optimize the hidden class graph to minimize their memory use. Our solution collects the hidden class graph and related information for an application in a profiling run, and optimizes the graph offline. We reduce the number of hidden classes by avoiding introducing intermediate ones, for instance when properties are added one after another. Our optimizations allow the VM to assign the most likely final hidden class to an object at its creation. They also minimize re-allocation of storage for property values, and reduce the polymorphism of inline caches.

We implemented these optimizations in a JavaScript VM, eJSVM, and found that offline optimization can eliminate 61.9% of the hidden classes on average. It also improves execution speed by minimizing the number of hidden class transitions for an object and reducing inline cache misses.

  • Profile Guided Offline Optimization of Hidden Class Graphs for JavaScript VMs in Embedded Systems
    T. Ugawa, S. Marr, R. Jones; In Proceedings of the 14th ACM SIGPLAN International Workshop on Virtual Machines and Intermediate Languages, VMIL'22, p. 11, ACM, 2022.
  • Paper: PDF
  • DOI: 10.1145/3563838.3567678
  • BibTex: bibtex
    @inproceedings{Ugawa:2022:HCGOpt,
      abstract = {JavaScript is increasingly used for the Internet of Things (IoT) on embedded systems. However, JavaScript's memory footprint is a challenge, because normal JavaScript virtual machines (VMs) do not fit into the small memory of IoT devices. In part this is because a significant amount of memory is used by hidden classes, which are used to represent JavaScript's dynamic objects efficiently.
        
      In this research, we optimize the hidden class graph to minimize their memory use. Our solution collects the hidden class graph and related information for an application in a profiling run, and optimizes the graph offline. We reduce the number of hidden classes by avoiding introducing intermediate ones, for instance when properties are added one after another. Our optimizations allow the VM to assign the most likely final hidden class to an object at its creation. They also minimize re-allocation of storage for property values, and reduce the polymorphism of inline caches.
        
      We implemented these optimizations in a JavaScript VM, eJSVM, and found that offline optimization can eliminate 61.9% of the hidden classes on average. It also improves execution speed by minimizing the number of hidden class transitions for an object and reducing inline cache misses.},
      author = {Ugawa, Tomoharu and Marr, Stefan and Jones, Richard},
      booktitle = {Proceedings of the 14th ACM SIGPLAN International Workshop on Virtual Machines and Intermediate Languages},
      day = {5},
      doi = {10.1145/3563838.3567678},
      keywords = {EmbeddedSystems HiddenClasses InlineCaching IoT JavaScript MeMyPublication OfflineOptimization VirtualMachine myown},
      location = {Auckland, New Zealand},
      month = dec,
      pages = {11},
      pdf = {https://stefan-marr.de/downloads/vmil22-ugawa-et-al-profile-guided-offline-optimization-of-hidden-class-graphs.pdf},
      publisher = {ACM},
      series = {VMIL'22},
      title = {Profile Guided Offline Optimization of Hidden Class Graphs for JavaScript VMs in Embedded Systems},
      year = {2022},
      month_numeric = {12}
    }
    

Acknowledgments

Thanks to Tomoharu and Richard for suggestions and corrections on this blog post.

The Cost of Safety in Java

Overhead of Null Checks, Array Bounds, and Class Cast Exceptions in GraalVM Native Image

Normally, I’d say, please do not try this at home. Though, this time, it’s more appropriate:

Warning: Do not do this in production!

It’s fine to do it at home and for curiosity, but we are all better off when our programming languages are safe.

For context, with the increased interest in Rust, people are arguing again about all kind of programming languages and their performance characteristics with respect to safety. One of the widely used ones is Java of course, which I am using for my research.

With GraalVM’s native image at hand, it seemed like a good opportunity to assess roughly how much performance goes into some of Java’s memory safety features. I am using native image ahead-of-time compilation here, because JVMs like HotSpot can do all kind of nice tricks to remove more overhead at run time than other statically compiled languages can. To level the playing field, let’s stick to static ahead-of-time compilation for this experiment.

Memory Safety in Java

Let’s start with a few examples of how Java ensures memory safety.

Implicit Null Checks

When accessing an object in Java, it first makes sure that the object is an actual object and not null. This means, when the following code tries to access the Element object e, we have an implicit null check, which prevents us from accessing arbitrary memory.

public Object testImplicitIsNull(Element e) {
  return e.getValue();
}

To illustrate this, here the relevant part of the compiler graph:

We can see an If node, which is not in our Java code. This If is the implicit null check. In the graph, we also see a IsNull node that access the method’s parameter 1, indicated as P(1) in the graph. The true branch on the left of the If leads to a BytecodeException node, which in later compiler phases would be mapped to a classic NullPointerException.

Implicit Array Bounds Checks

When accessing arrays, Java similarly checks that we are not accessing the array outside of its boundaries. Again, trying to make sure that we do not access arbitrary memory. The following method is thus compiled to include an implicit bounds check, too:

public Object boundsCheck(Object[] array, int index) {
  return array[index];
}

This time, let’s look at the complete compiler graph for the method:

Reading it from the top, we notice again the null check. And indeed, since arrays are objects, we of course also need to check that we have a proper object first.

A little further down, we see the ArrayLength node, which reads the length of the array, and then we see a |<| node, which checks that the second parameter to our method, our index is smaller than the array length but not smaller than zero. Only if that’s the case, we will read from the array with the LoadIndexed node. For the case that we have an out-of-bounds index, the compiler turns the BytecodeException node into a ArrayIndexOutOfBoundsException exception.

Other safety features I won’t go into include for instance, casts causing a ClassCaseException to avoid unsafe accesses.

Removing Safety

For this experiment, I implemented two new compiler phases for Graal. Both essentially look for the BytecodeException nodes and remove these exception branches from the compiler graphs.

In the Graal compiler, this is relatively straight forward. We can find the If node from the BytecodeException node and set it’s condition to always be either true of false depending in which branch the BytecodeException node was, ensuring that it can never be activated. Graal can then remove this branch as it is not reachable anymore.

I used two phases so that I catch the exception branches very early on the compilation process in the first phase. And then essentially do the same thing a second time after some higher level nodes have been lowered into groups of nodes which may again contain these safety checks.

At the moment, I don’t distinguish different types of bytecode exceptions, of which there are a few, which means, the result is indeed inherently unsafe, and even changes Java semantics, possibly breaking programs. Some Java code may, and in practice indeed does, depend on these implicit exceptions. Thus, the compiler phases are only applied to application code for which I know that it is safe.

The code can be found here: graal:unsafe-java.

Experiments

The first set of experiments is a set of benchmarks that were designed to assess cross-language performance, the Are We Fast Yet benchmarks. They contain 9 microbenchmarks, and 5 slightly larger ones to cover various common language features of object-oriented languages.

On some of these benchmarks, it doesn’t make any difference at all, since the benchmarks are dominated by other aspects. Though, even for larger benchmarks, it can reduce run time by 3-10%.

    Run time
    change
CD -4%
DeltaBlue -10%
Havlak -3%
Json -5%
Richards -10%
Bounce -7%
List -3%
Mandelbrot 0%
NBody 0%
Permute -4%
Queens -28%
Sieve -22%
Storage 0%
Towers -5%

To emphasize: these benchmarks are designed to compare performance across languages, and to identify possible compiler optimizations. They are very likely not going to translate directly to the application you may care about.

For the full results, see ReBenchDB:are-we-fast-yet.

Unsafe Truffle Interpreters

For me, there are a few “applications” I care deeply about at the moment. My research focuses on speeding up interpreters, especially in the GraalVM/Truffle ecosystems.

TruffleSOM, a research language, has two different type of interpreters. A Truffle-style abstract syntax tree (AST) interpreter as well as a interpreter based on bytecodes.

For the AST interpreter, focusing just on the larger benchmarks, we see the following results:

    Run time
    change
DeltaBlue -5%
GraphSearch -7%
Json -5%
NBody -5%
PageRank -8%
Richards -6%

So, the changes are in the 5-8% range. For an interpreter, that is a significant improvement. Typically any improvement here is hard won. (Full Results: TruffleSOM: AST interp.)

Let’s also look at the bytecode interpreter:

    Run time
    change
DeltaBlue -8%
GraphSearch -12%
Json -10%
NBody -14%
PageRank -16%
Richards -14%

Here, we see a higher benefit, in the 8-16% range. This is likely because of the access to the bytecode array, which now doesn’t do the bounds check any longer. Since this is a very common operation, it gives a good extra bonus. (Full Results: TruffleSOM: BC interp.)

Since TruffleSOM is a research language, fairly small in comparison to other languages, let’s have a look at the impact on TruffleRuby, which is a mature implementation of Ruby, and Ruby is a quite a bit more complex language.

Here I cherry picked a couple of larger Ruby applications just to keep it brief:

    Run time
    change
AsciidoctorConvertTiny -5%
AsciidoctorLoadFileTiny -5%
ActiveSupport -10%
LiquidCartParse -10%
LiquidCartRender -1%
LiquidMiddleware -17%
LiquidParseAll -9%
LiquidRenderBibs -7%
OptCarrot -9%

Here we see a range of 1-17% improvements, but the measurements are somewhat noisy. (Full Results: ReBenchDB: TruffleRuby)

The big question is now: are our interpreters correct enough to do away with the extra safety layer? It’s likely not a good idea, no…

For questions or comments, please find me on Twitter.

Older Posts

Subscribe via RSS