Archive by Author

A 10 Year Journey, Stop 3: Performance, Performance, and Metaprogramming

The third post of this series is about how I started using Truffle and Graal, pretty much 4 years ago. It might be in parts ranty, but I started using it when it was in a very early stage. So, things are a lot better today.

Concurrency needs Performance, usually

As mentioned in the last post, the result of my PhD was an ownership-based metaobject protocol that is meant to enable VMs to support a wide range of different concurrency models. The major problem with the approach, and also my evaluation, was that I couldn’t show that it is practical. The RoarVM is a simple bytecode interpreter, and the literature on compiling and optimizing metaobject protocols talked only about static systems with restrictions that would make the ownership-based MOP impossible. Worse, MOPs were kind of abandoned by the research community, because performance was an issue. Many researchers moved on to aspect-oriented approaches, at least in part, because aspects are applied more targeted and thus, incur less general overhead than MOPs.

A hard problem, abandoned for 20 years, and nobody really interested in it anymore? Pretty much sounds like it’s a stupid idea, right? It probably was, and perhaps still is. But concurrency researchers typically want to show that their techniques are useful for performance critical applications. And, I wanted to do that, too.

How to get a fast VM?

So, I needed a new platform for my research. One option would have been to take the RoarVM all the way. Build a state-of-the-art JIT compiler for it. Another would be to apply the ideas of the RoarVM to the CogVM and improve its JIT compiler. But building another JIT compiler? That’s a huge undertaking. Would probably take a few person years to get anywhere useful. And, while I am curious about compilers, I am not really seeing myself building more than a baseline JIT compiler.

But what other options do we have? RPython is a pretty interesting project. It promises you a meta-tracing JIT compiler for your simple interpreter. Sounds great. But there’s a catch: RPython doesn’t really have a concurrency story compatible with my goals. There’s a bit of experimenting with STM going on, but no decent shared-memory GC.

And then, there was that Truffle thing, colleagues from Oracle Labs kept talking about. It was just released as open source at that point. Truffle promised that simple AST-based interpreters combined with partial evaluation would run applications as fast as Java on a JVM. Sounds great. And, the JVM got everything I need for my research, too.

TruffleSOM: The first steps with a simple Smalltalk on top of Truffle

Truffle it is, I thought, and started implementing the little Smalltalk I had been toying around with in 2007 on top of it. There was already a Java version, simply called SOM. So, how hard could it be?

Well, turns out, much harder than expected. Me being not really a compiler person, I had no intuition for how partial evaluation would work on my interpreter. And as a testament to how hard it was for me, apparently even to this date, I am third in the ranking of who spammed the graal-dev mailing list most.

I suppose there were three important reasons for it. As I mentioned before, I had no intuition of how the partial evaluation really works, and what kind of optimizations I can expect from the systems. The second reason was that I did not really have access to the expertise. Mailing lists are fine but slow. And people need to be willing to take the time to answer. So, during my endeavor of building a fast Smalltalk, the most helpful conversations were actually in person at conferences with people from the Truffle team, or when I actually got the chance to visit them in Linz. And the third reason, which is fixed by now, was the overall stability of the system. To me it was very surprising that I ran into many bugs in Truffle and the Graal compiler. I could somehow not really believe that the Truffle team didn’t encounter those issues with their JavaScript and Ruby implementations. But as it turns out, each new language implementation does things somewhat different and the languages are just different enough to trigger new edge cases that haven’t been considered before. As far as I know, there is still a little of that happening today with Graal. Every new language leads to one or two bugs being discovered that none of the previous languages hit.

How can this be sooo hard???

All in all, my experience to build a new language with Truffle and Graal was far from pleasant. On the contrary, it was frustrating. I often just didn’t have the knowledge to debug problems myself. And the Truffle team didn’t really have time to teach an outsider all those basics.

So, yeah, I was very close to throwing in the towel. Well, I actually kind of did. At least a little. There was a moment of “fuck this, how can it be so hard? screw Truffle! Let’s look into RPython!”

And PySOM was born. PySOM is a literal port of the SOM bytecode interpreter to Python. SOM is really a simple and small language. If you know what you’re doing, and can type reasonably fast, you can implement it in 3 or 4 days in a new language.

PySOM was the first step. The next step was RPySOM: a port from Python to RPython, which is the Python subset that the RPython toolchain can compile statically into a fast interpreter with a meta-tracing JIT compiler. This experience was sooo much more pleasant. One big reason was that the PyPy+RPython community uses an IRC channel for communication and was super friendly and happy to help with all my problems. Another reason was that I knew Carl Friedrich, one of the PyPy people already for a while, and he guided me through the classic pit falls. And, I suppose, RPython at that point was already much more mature than Truffle used to be. So, fewer crashes, and I think, I didn’t really trigger much bugs in RPython either. And, since it is a trace-based compiler, understanding what the optimizer did was also much easier, because the result mapped much more directly back to the input code of the interpreter than with a fancy graph-based compiler IR. So, yeah, RPySOM was born, and with that additional knowledge, I kind of managed to make TruffleSOM a reality, too.

Some of this story, and what we learned was written down in the Are We There Yet? article.

And Finally, Fast Metaprogramming!

Then it was time to get back to my original problem: how do I get my ownership-based metaobject protocol fast? Well, turns out if you got a fancy JIT compiler, the solution is pretty simple and already existed in other Truffle interpreters: dispatch chains. Dispatch chains are essentially lexical caches for dispatch operations. A generalization of polymorphic inline caches if you will.

Together with Chris Seaton, we published a paper on Zero-Overhead Meta Programming, where we were able to show how all kind of reflective operations can be made fast, and were I was finally able to show that my metaobject protocol can be realized without sacrificing performance.

A bit later, I also wrote up a longer paper comparing meta-tracing and partial evaluation in more detail.

Cutting a long story short: nothing is as easy as it sounds. In total, it took me two years to go from a simple Smalltalk AST interpreter to a system that can take on Java. But, things should be better today. When starting to implement a new language with Truffle, there are now a few tutorials, and other resources, and the platform is much more mature and pleasant to use!

Next week, I might take a break from this series, but there are at least two more posts coming:

  • Concurrency and Tooling, or ‘What is project MetaConc?’
  • and, Growing the SOM Family

A 10 Year Journey, Stop 2: Supporting All Kind of Concurrency Models on a Simple VM

Last week, I started a series of posts to go over some of the projects I was involved in during my first 10 years working on language implementations. Today’s post focuses on my time as PhD student.

Let’s do something fun with… cconrnceury and pileaslarlm

After finishing my master thesis in 2008, I still wanted to continue this kind of work. And there was another topic hot at the time, which I wanted to look into: concurrency. In 2008, software transactional memory was all the rage. The multicore revolution was going strong, and we all expected to use 32 core processors in 2015. I guess, the 32 cores didn’t quite work out. Nonetheless, concurrency and parallelism is a topic that’s relevant for a much larger group of people than it used to be.

As I said, the topic was kind of hot, and the people at the Software Languages Lab where interested in it as well, and did cool things with concurrency and language implementations. Most widely known is perhaps AmbientTalk, an actor language for peer-to-peer applications on top of ad hoc mobile networks.

I got lucky, and my project proposal to decouple abstract from concrete concurrency models got accepted by IWT and I got funding for four years of PhD research. I have to say, it was a big vision. In the end, my PhD scratched perhaps at the surface of a quarter of the things that would be necessary to realize the vision put forth in the proposal.

Either way, I had the chance to work on quite a few interesting ideas. Early on, I got involved with David Ungar and Sam Adams work on the Renaissance project. David worked on a Smalltalk VM for a manycore processor with 64 cores. In the beginning, I didn’t have access to those 64 core Tilera processors. Instead, I started porting, what became the RoarVM, to standard multicore systems. The RoarVM is essentially a reimplementation of the Squeak Smalltalk interpreter in C++. The goal was to support classic shared-memory concurrency, and instead of fearing race conditions, the goal was to handle them retroactively: race and repair. I haven’t really worked much on the race-and-repair idea myself, but the work on a fully concurrent and parallel VM was very exciting.

As mentioned above, the lab was interested in actor languages. So, I guess, it isn’t really surprising that I started dabbling with them as well. One of the results was ActorSOM++. It was a simple Actor language based on SOM++, a SOM implemented in C++.

I got also involved in research on making the Actor model more useful for commodity multicore systems. Together with Joeri De Koster, we worked on a few papers on a domain model. We wanted to preserve the basic guarantees of the actor model, while still providing the data parallelism of shared memory concurrency.

And then there was that ‘Rete thing’. Lode Hoste used a Rete-based system to enable declarative multi-touch applications. As one might imagine, that’s the kind of stuff that’s great for giving impressive demos. So, the two of us decided to spend a week on parallelizing the CLIPS Rule Engine. Of course, a week wasn’t enough, but it gave us enough of an idea what we are up for to start our own parallel Rete implementation. Well, actually, Thierry Renaux did most of the work. The result was PARTE, an actor-based parallel Rete engine. And of course, in 2013, there also had to be a version for the cloud.

These and various other experiments lead me to proposing a metaprogramming-based solution for the problem of supporting all kind of different concurrency models on the same VM. In the end, this approach, the ownership-based meta-object protocol (OMOP) became also the focus of my PhD dissertation. The OMOP allowed me to customize the basic behavior for field accesses and method dispatches for instance to enforce isolation between actors, or to implement a basic STM. My implementation was based on the RoarVM, which means, everything was pretty slow. So, performance remained one of the big open questions. The other open question was whether we can actually find ways to use all these different concurrency models safely together.

But, those questions didn’t really fit into the PhD anymore. And, they might also better fit into one of the next posts on:

  • performance, performance, and metaprogramming
  • and safe combination of concurrency models

10 Years of Language Implementations

First Stop: VMs, Compilers, and Modularity

In April 2007, I embarked on a long journey. A journey on which I already met a lot of interesting people, learned many fascinating things, and had a lot of fun implementing programming languages. It all started on this day in 2007. At least if I can trust the date of my first commit on CSOM to my SVN server back then. A lot has happened in the last 10 years, and, perhaps mostly for myself, I wanted to recount some of the projects I was involved in.

It all started with me wanting to know more about the low-level things that I kind of avoided during my bachelor studies. I have been programming since a long time, but never really knew how it all actually worked. So, I inscribed in the excellent Virtual Machines (VMs) course, which was taught by Michael Haupt at the time. I also took a course on Software Design, in which I studied Traits.

Why do I mention traits? Well, I had been using PHP since 2000 or so. It was my language of choice. And to understand traits better, I decided the best way would be to implement them for PHP. So, more work on language implementations. I have to admit, the main reason I didn’t just study them in Squeak Smalltalk was because Squeak looked silly and I didn’t like it. I guess, I was just stubborn. And that stubbornness caused me to inflict traits on PHP as part of my first venture into programming language design.

As a result, my traits for PHP were released about 5 years later with PHP 5.4. So, it took a lot of stubbornness… Fun fact: Wikipedia explains traits with a PHP example, perhaps because PHP is one of the few curly-brace languages that is relatively close to the Smalltalk traits design.

Meanwhile, in the VM course, we started to look in detail into a little Smalltalk called SOM (Simple Object Machine). Specifically, we worked with CSOM, the C implementation of SOM. Together with a fellow student, I chose a rather ambitious topic: build a just-in-time (JIT) compiler for SOM. Well, in the end, I think, he did most of the work. And I learned more than I could have imagined. In our final presentation we reported performance gains of 20% to 55%. The JIT compiler itself was a baseline compiler that translated bytecodes one by one to x86 machine code. The only fancy thing it did was to supporting hybrid stack frames, i.e., using essentially the C stack, but still providing a full object representation of the stack as Smalltalk context objects.

This JIT compiler project was a lot of fun, but also a lot of headache… Perhaps not something, I’d generally recommend as a first project. However, after the VM course, and the work on traits, I was really interested to continue and learn more about VMs and modularity, and perhaps also combine it with the hyped aspect-oriented, feature-oriented, and context-oriented programming ideas, which I haven’t taken the time to study yet.

Under the guidance of Michael and Robert Hirschfeld, I started the work on my master thesis, which resulted in a Virtual Machine Architecture Definition Language (VMADL). VMADL combined ideas of feature-oriented and aspect-oriented programming to allow us to build a VM product line: CSOM/PL. It used CSOM, from the VM course, and combined the results of the various student projects. So, one could built a CSOM for instance with native or green threads, with a reference counting GC, or a traditional mark/sweep GC, and so on. It was all based on a common code base of service modules, which were linked together with combiners that used aspects to weave in necessary functionality at points explicitly exposed by the service modules. Since that is all very brief and abstract, the CSOM/PL paper is probably a better place to read up on it.

I guess, that’s enough for today. Since this only covers the first few steps until summer 2008, there is more to come on:

  • supporting all kind of concurrency models on a simple VM
  • performance, performance, and metaprogramming
  • and safe combination of concurrency models

SOMns 0.2 Release with CSP, STM, Threads, and Fork/Join

Since SOMns is a pure research project, we aren’t usually doing releases for SOMns yet. However, we added many different concurrency abstractions since December and have plans for bigger changes. So, it seems like a good time to wrap up another step, and get it into a somewhat stable shape.

The result is SOMns v0.2, a release that adds support for communicating sequential processes, shared-memory multithreading, fork/join, and a toy STM. We also improved a variety of things under the hood.

Note, SOMns is still not meant for ‘users’. It is however a stable platform for concurrency research and student projects. If you’re interested to work with it, drop us a line, or check out the getting started guide.

0.2.0 – 2017-03-07 Extended Concurrency Support

Concurrency Support

  • Added basic support for shared-memory multithreading and fork/join
    programming (PR #52)

    • object model uses now a global safepoint to synchronize layout changes
    • array strategies are not safe yet
  • Added Lee and Vacation benchmarks (PR #78)

  • Configuration flag for actor tracing, -atcfg=
    example: -atcfg=mt:mp:pc turns off message timestamps, message parameters and promises

  • Added Validation benchmarks and a new Harness.

  • Added basic Communicating Sequential Processes support.
    See PR #84.

  • Added CSP version of PingPong benchmark.

  • Added simple STM implementation. See s.i.t.Transactions and PR #81 for details.

  • Added breakpoints for channel operations in PR #99.

  • Fixed isolation issue for actors. The test that an actor is only created
    from a value was broken (issue #101, PR #102)

  • Optimize processing of common single messages by avoiding allocation and
    use of object buffer (issue #90)

Interpreter Improvements

  • Turn writes to method arguments into errors. Before it was leading to
    confusing setter sends and ‘message not understood’ errors.

  • Simplified AST inlining and use objects to represent variable info to improve
    details displayed in debugger (PR #80).

  • Make instrumentation more robust by defining number of arguments of an
    operation explicitly.

  • Add parse-time specialization of primitives. This enables very early
    knowledge about the program, which might be unreliable, but should be good
    enough for tooling. (See Issue #75 and PR #88)

  • Added option to show methods after parsing in IGV with
    -im/--igv-parsed-methods (issue #110)

Communicating Sequential Processes for Newspeak/SOMns

One possible way for modeling concurrent systems is Tony Hoare’s classic approach of having isolated processes communicate via channels, which is called Communicating Sequential Processes (CSP). Today, we see the approach used for instance in Go and Clojure.

While Newspeak’s specification and implementation come with support for Actors, I want to experiment also with other abstractions, and CSP happens to be an interesting one, since it models systems with blocking synchronization, also know as channels with rendezvous semantics. I am not saying CSP is better in any specific case than actors. Instead, I want to find out where CSP’s abstractions provide a tangible benefit.

But, the reason for this post is another one. One of my biggest quibbles with most CSP implementations is that they don’t take isolation serious. Usually, they provide merely lightweight concurrency and channels, but they rarely ensure that different processes don’t share any mutable memory. So, the door for low-level race conditions is wide open. The standard argument of language or library implementers is that guaranteeing isolation is not worth the performance overhead that comes with it. For me, concurrency is hard enough, so, I prefer to have the guarantee of proper isolation. Of course, another part of the argument is that you might need shared memory for some problems, but, I think we got a more disciplined approach for those problems, too.

Isolated Processes in Newspeak

Ok, so how can we realize isolated processes in Newspeak? As it turns out, it is pretty simple. Newspeak already got the notion of values. Values are deeply immutable objects. This means values can only contain values themselves, which as a consequence means, if you receive some value from a concurrent entity, you are guaranteed that the state never changes.

In SOMns, you can use the Value mixin to mark a class as having value semantics. This means that none of the fields of the object are allowed to be mutable, and that we need to check that fields are only initialized with values in the object’s constructor. Since Newspeak uses nested classes pretty much everywhere, we also need to check that the outer scope of a value class does not have any mutable state. Once that is verified, an object can be a proper deeply immutable value, and can be shared with out introducing any data races between concurrent entities.

Using this as a foundation, we can require that all classes that represent CSP processes are values. This gives us the guarantee that a process does not have access to any shared mutable state by itself. Note, this is only about the class side. The object side can actually be a normal object an have mutable state, which means, within a process, we can have normal mutable state/objects.

Using the value notion of Newspeak feels like a very natural solution to me. Alternative approaches could use a magic operator that cuts off lexical scope. This is something that I have seen for instance in AmbientTalk with its isolates. While this magic isolate keyword gives some extra flexibility, it is also a new concept. Having to ensure that a process’ class is a value requires that its outer lexical scope is a value, and thus, restricts a bit how we structure our modules, but, it doesn’t require any new concepts. One other drawback is here that it is often not clear that the lexical scope is a value, but I think that’s something where an IDE should help and provide the necessary insights.

In code, this looks then a bit like this:

class ExampleModule = Value ()(
  class DoneProcess new: channelOut = Process (
  | private channelOut = channelOut. |
  )(
    public run = ( channelOut write: #done )
  )
  
  public start = (
    processes spawn: DoneProcess
               with: {Channel new out}
  )
)
So, we got a class DoneProcess, which has a run method that defines what the process does. Our processes module allows us to spawn the process with arguments, which is in this case the output end of a channel.

Channels

The other aspect we need to think about is how can we design channels so that they preserve isolation. As a first step, I’ll only allow to send values on the channel. This ensure isolation and is a simple efficient check whether the provided object is a value.

However, this approach is also very restrictive. Because of the deeply immutable semantics of values, they are quite inflexible in my experience.

When thinking of what it means to be a value, imagine a bunch of random objects: they all can point to values, but values can never point back to any mutable object. That’s a very nice property from the concurrency perspective, but in practice this means that I often feel the need to represent data twice. Once as mutable, for instance for constructing complex data structures, and a second time as values so that I can send data to another process.

A possible solution might be objects with copy-on-transfer semantics, or actual ownership transfer. This could be modeled either with a new type of transfer objects, or a copying channel. Perhaps there are other options out there. But for the moment, I am already happy with seeing that we can have proper CSP semantics by merely checking that a process is constructed from values only and that channels only pass on values.

Since the implementation is mostly a sketch, there are of course more things that need to be done. For instance, it doesn’t yet support any nondeterminism, which requires an alt or select operation on channels.