We know that Ruby and especially Rails applications can be very dynamic and pretty large. Though, many of the optimizations interpreters and even just-in-time compilers use have been invented in the 1980s and 1990s before Ruby and Rails even existed. So, I was wondering: do these optimizations still have a chance of coping with the millions of lines of Ruby code that large Rails apps from Shopify, Stripe, or GitLab have? Unfortunately, we don’t have access to such applications. As the next best thing, we took the largest Ruby benchmarks we could get our hands on, and analyzed those.

As part of her research, Sophie wrote a paper investigating the behavior of method call sites in detail. She looked at how well optimizations such as lookup caches, target duplicate elimination, and splitting apply to modern Ruby code. I’ll use the work here as a foundation and zoom in on the Rails apps we looked at. For all details including the measurement methodology, I’ll defer to sec. 3 of the paper. It also discusses how Sophie instrumented TruffleRuby and how the data was processed.

BlogRails, ERubiRails, the Liquid Benchmarks

The benchmarks I am going to be focusing on are called BlogRails, ERubiRails, LiquidCartRender, and LiquidRenderBibs. BlogRails, usually referred to as railsbench, is a small Ruby on Rails application, simulating a basic blog, as created by Rails’ scaffold generator. The benchmark accesses existing blog posts and creates new ones. The ERubiRails is a similarly small Rails app and renders an ERB template from the Discourse project.

I also included two Liquid template language benchmarks here out of curiosity. LiquidCartRender uses Liquid to render an HTML page for a shopping cart. LiquidRenderBibs renders an HTML page with a list of papers that have a variety of different data bits to be shown (specifically this one here).

            Poly. and Used Poly. and
    Statement   Function Calls Megamorphic Call Megamorphic
Benchmark Statements Coverage Functions Coverage (in 1000) Calls Sites Call Sites
BlogRails 118,717 48% 37,595 38% 13,863 7.4% 52,361 2.3%
ERubiRails 117,922 45% 37,328 35% 12,309 5.4% 47,794 2.3%
LiquidCartRender 23,562 39% 6,269 30% 236 5.5% 3,581 2.4%
LiquidRenderBibs 23,277 39% 6,185 29% 385 23.4% 3,466 2.8%

As the table above shows, the Rails benchmarks have about 120,000 Ruby statements each, of which 45-48% are executed. Of the circa 37,500 functions, about 35-38% are executed. In total, the BlogRails benchmark makes about 13,863,000 function calls. 7.4% of these calls are polymorphic or megamorphic.

In Ruby, a call site is considered to be monomorphic, if there is a single receiver class seen during execution, which also means there’s usually a single method that is being called. When there is more than one different receiver type, we call the call site polymorphic. Once there were more than a certain number of receiver types, a call site is megamorphic. In TruffleRuby, this happens when more than 8 different receiver types were used at the call site. Though, this is a bit of a simplification, and we’ll get into more details in the next section.

Until then we can observer that ERubiRails seems a bit less polymorphic. Only 5.4% of its calls are polymorphic or megamorphic.

The Liquid benchmarks are much smaller, with only about 23,500 statements in about 6,200 functions. The number of calls being between 236,000 and 385,000 is also significantly smaller. Surprisingly, about 23% of all calls in the LiquidRenderBibs benchmark are polymorphic. While I haven’t looked into it in more detail, I would assume that this might be an artifact of the template having to handle a large number of differences in the input data.

Compared to other languages, these numbers do not feel too different. For instance, the Dacapo Con Scala project found somewhat similar numbers for Java and Scala. In the Scala benchmarks they looked at, 89.7% of all calls were monomorphic. The Java benchmarks had about 91.5% of all calls being monomorphic.

This means what we see for Rails is roughly in line with what one would expect. This is good news, because it means that the classic optimizations are likely going to work as expected.

But before getting too enthusiastic, let’s dig a little deeper to see whether that is indeed the case.

Receiver versus Target Polymorphism

Let’s take a very simple, Rails-like example as a starting point. The following code shows the ApplicationController defining the status method, which simply returns an HTTP status code.

We also define an ArticlesController as subclass of ApplicationController. The ArticlesController implements the index method, which for brevity is kept empty.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
class ApplicationController
  def status
    200
  end
end

class ArticlesController < ApplicationController
  def index
  end
end

controllers = [
  ArticlesController.new,
  ApplicationController.new
]
controllers.select { |c| c.status == 200 }

At the end of the example on line 16, we have an array with both controllers, and select the ones with the status code being 200. The call to the status method is receiver-polymorphic. This means the call site sees multiple different receiver types, in our case the two controllers. Though, at the same time, the call site is target-monomorphic. This means, there’s only a single method that is activated.

TruffleRuby optimizes this case by using two polymorphic inline caches or more accurately dispatch chains, one after another, as depicted in fig. 1.

Fig. 1. Optimizing Method Dispatch
with two Consecutive Dispatch Chains to Eliminate Duplicate Targets.

By using two dispatch chains, a language implementation can often turn a receiver-polymorphic call site into a target-monomorphic one. The first dispatch chain acts as classic lookup cache. It takes the receiver type1 1 Since TruffleRuby uses object shapes to optimize objects, they would be used as a proxy for the receiver type. and caches the method as the result of a lookup. The second cache deduplicates the target methods and in the case of TruffleRuby, it caches Truffle’s call nodes, which implement the method activation, but also optimizations such as splitting.

Based on our data, eliminating duplicate targets is also an effective optimization for Rails:

Number of Calls After Eliminating Duplicate Targets
Benchmark Polymorphic Megamorphic Polymorphic Megamorphic
BlogRails 956,515 63,319 -48.8% -99.1%
ERubiRails 626,535 40,699 -37.4% -98.6%
LiquidCartRender 12,598 280 -73.3% -100.0%
LiquidRenderBibs 89,866 280 -73.7% -100.0%

The table above gives the absolute number of calls these benchmarks do. As we can see in column two and three there are relatively few megamorphic calls to begin with. In TruffleRuby, a call site is megamorphic when there are more than 8 different receivers or target methods. Megamorphic calls can be a performance issue, especially when class hierarchies are deep and method lookup is costly, because for such megamorphic calls, we cannot cache the lookup results.

The good news is that eliminating of duplicate targets is highly effective in avoiding megamorphic calls. As we can see in column five, most calls stop being megamorphic. However, the optimization is much less effective in avoiding polymorphic calls, reducing their number by only 37.4-48.8%. This means, that about 50-60% of calls are still polymorphic.

For a basic interpreter, this isn’t too bad, because we still avoid the overhead of a lookup. However, for TruffleRuby with its just-in-time compilation, this situation is not ideal, because method inlining, i.e., replacing a call by taking the method body and integrating it into the caller during compilation, is limited.

On the positive side, our Liquid render benchmarks benefit nicely here. While I haven’t looked in detail, the number of megamorphic calls being the same suggests that these calls are made in the initial setup and eliminating duplicate targets prevents them from being megamorphic.

Method Splitting

TruffleRuby uses an optimization that is not as common in just-in-time compiling systems: method splitting. Most just-in-time compilers rely solely on inlining to enable classic compiler optimizations and get efficient native code. Though, since TruffleRuby builds on the Truffle framework with its metacompilation approach, it tries harder to optimize even before the just-in-time compilation kicks in.

Truffle’s method splitting copies a method in a state that is uninitialized. For us most importantly this means, it copies the method without the lookup cache entries as illustrated in fig. 2. The split method, i.e. the copy, is then associated with a single call site. The idea is that this copy specializes itself in the context of this single caller, which hopefully means method calls are more likely to be monomorphic.

Fig. 2. Method Splitting copies a method for use at a specific call site. The copy is uninitialized, which means the dispatch chains do not contain any entries yet.

So, is splitting succeeding at monomorphizing call sites? Let’s look at the data. Note, we already eliminated duplicate targets. Thus, the numbers are a little smaller here.

Number of Calls (w/o duplicate targets) After Splitting
Benchmark Polymorphic Megamorphic Polymorphic Megamorphic
BlogRails 490,072 557 -100% -100%
ERubiRails 391,997 553 -100% -100%
LiquidCartRender 2,000 0 -100% n/a
LiquidRenderBibs 23,633 0 -100% n/a

Indeed, splitting is highly effective in turning polymorphic and megamorphic calls into monomorphic calls, which allows the just-in-time compiler to aggressively inline and optimize the Ruby code.

Overall Monomorphization

As we have seen in the last table, the polymorphic and megamorphic calls were all monomorphized. Though, let’s take a slightly different look at the data. Instead of looking at the run-time calls, let’s look at how many targets there are at a call site.

Maximum Number of Targets
before target duplicates after
Benchmark optimizations eliminated splitting
BlogRails 206 24 2
ERubiRails 206 24 2
LiquidCartRender 20 5 1
LiquidRenderBibs 20 7 1

From this table we can see that the Rails benchmarks have at least one call site with 206 different receiver types. After eliminate duplicate target methods, we see at most 24 different targets. Adding splitting to the system reduces it further to at most 2 entries. As we saw from the number of run-time calls, these optimizations in combination are indeed highly effective for Rails applications.

From this brief look, we can conclude that despite these optimizations having been invented some 30-40 years ago, they are still highly effective even for today’s dynamic Ruby on Rails systems.

Who You Gonna Call: Analyzing the Run-time Call-Site Behavior of Ruby Applications

In our paper, we go into many more details. We also look at how blocks behave (spoiler: they turn out to be slightly more polymorphic than methods). We investigate how lookup caches evolve over time and find patterns that may help us to improve performance further in the future. We also noticed that TruffleRuby’s splitting is a little too enthusiastic. For instance, blocks/Procs were always split, which has been fixed already. There is more work to be done to see whether splitting can further be reduced to avoid redundant work at run time. Though, that’s for another day.

Meanwhile, please give the paper a read, attend our presentation at DLS, and find us with questions, comments, and suggestions on Twitter @SophieKaleba and @smarr.

Abstract

Applications written in dynamic languages are becoming larger and larger and companies increasingly use multi-million line codebases in production. At the same time, dynamic languages rely heavily on dynamic optimizations, particularly those that reduce the overhead of method calls.

In this work, we study the call-site behavior of Ruby benchmarks that are being used to guide the development of upcoming Ruby implementations such as TruffleRuby and YJIT. We study the interaction of call-site lookup caches, method splitting, and elimination of duplicate call-targets.

We find that these optimizations are indeed highly effective on both smaller and large benchmarks, methods and closures alike, and help to open up opportunities for further optimizations such as inlining. However, we show that TruffleRuby’s splitting may be applied too aggressively on already-monomorphic call-sites, coming at a run-time cost. We also find three distinct patterns in the evolution of call-site behavior over time, which may help to guide novel optimizations. We believe that our results may support language implementers in optimizing runtime systems for large codebases built in dynamic languages.

  • Who You Gonna Call: Analyzing the Run-time Call-Site Behavior of Ruby Applications
    S. Kaleba, O. Larose, R. Jones, S. Marr; In Proceedings of the 18th Symposium on Dynamic Languages, DLS'22, p. 14, ACM, 2022.
  • Paper: PDF
  • DOI: 10.1145/3563834.3567538
  • BibTex: bibtex
    @inproceedings{Kaleba:2022:CallSites,
      abstract = {Applications written in dynamic languages are becoming larger and larger and companies increasingly use multi-million line codebases in production. At the same time, dynamic languages rely heavily on dynamic optimizations, particularly those that reduce the overhead of method calls.
      
      In this work, we study the call-site behavior of Ruby benchmarks that are being used to guide the development of upcoming Ruby implementations such as TruffleRuby and YJIT. We study the interaction of call-site lookup caches, method splitting, and elimination of duplicate call-targets.
      
      We find that these optimizations are indeed highly effective on both smaller and large benchmarks, methods and closures alike, and help to open up opportunities for further optimizations such as inlining. However, we show that TruffleRuby's splitting may be applied too aggressively on already-monomorphic call-sites, coming at a run-time cost. We also find three distinct patterns in the evolution of call-site behavior over time, which may help to guide novel optimizations. We believe that our results may support language implementers in optimizing runtime systems for large codebases built in dynamic languages.},
      acceptancerate = {0.4},
      author = {Kaleba, Sophie and Larose, Octave and Jones, Richard and Marr, Stefan},
      blog = {https://stefan-marr.de/2022/11/how-effective-are-classic-lookup-optimizations-for-rails-apps/},
      booktitle = {Proceedings of the 18th Symposium on Dynamic Languages},
      day = {7},
      doi = {10.1145/3563834.3567538},
      keywords = {Analysis CallSite DynamicLanguages Inlining LookupCache MeMyPublication Splitting myown},
      location = {Auckland, New Zealand},
      month = dec,
      note = {(acceptance rate 40%)},
      pages = {14},
      pdf = {https://stefan-marr.de/downloads/dls22-kaleba-et-al-analyzing-the-run-time-call-site-behavior-of-ruby-applications.pdf},
      publisher = {ACM},
      series = {DLS'22},
      title = {Who You Gonna Call: Analyzing the Run-time Call-Site Behavior of Ruby Applications},
      year = {2022},
      month_numeric = {12}
    }
    

Acknowledgments

Thanks to Sophie, Octave, and Chris Seaton for suggestions and corrections on this blog post.