I already posted the presentation I gave at the summer school earlier. In the following posts, I will report a bit about the lectures of the summer school, similar to my posts about the TPLI summer school of last year.
The first two days covered four different topics. Adam Welc started with an introduction to Software Transactional Memory, Jean-Pierre Talpin gave lectures on the topic Virtual Prototyping Embedded Architectures, Satnam Singh reported on his experience with Data-Parallel Programming, and last but not least, Peter Sewell introduced us to the mystics of hardware memory models with his talk titled Low-Level Concurrency – For Real.
Adam Welc: Software Transactional Memory
Adam, who works for Intel Labs, talked in his first presentation mainly about the semantics of STM and the different approaches for designing a STM system. The second presentation concentrated on how the implementation of such a system could work.
His presentations were very much in the spirit of the STM system Intel provides as a prototype, but the general ideas are used in other systems as well.
The different options for semantics of an STM included open or closed transaction nesting, transaction flattening as well as abort and retry. Furthermore, he discussed the interaction for instance with legacy code, which leads to the question of weak and strong atomicity properties, or the semantics of single global lock atomicity. The last semantic question he discussed was how consistency should be provided, and how pointer-privatization could be supported by the system. Afterwards, he outlined the design space for STM systems, introducing optimistic vs. pessimistic transactions as well as write buffering vs. in-place updates.
In the second talk, he presented the implementation details.
For their C++-based STM the compiler needs additional information about the characteristics of all code, which is done by annotations on class and method level. Similar, the compiler supports atomic blocks, abort, and retry statements. Beside all the implementation details, he also discussed several optimizations the compiler attempts to reduce the STM overhead with regard to read/write-barriers wherever possible. My gut feeling tells me, someone should write a STM+TracingJIT story. Not sure whether that buys anything, but what I have seen with STM looks a lot like the trace-guards.
The last slides he presented were about performance. The basic conclusion still is: performance is an issue. However, he gave me an interesting pointer to one of his papers, which basically states, that transaction can be used for looks without contention, quite straight forwardly: Transparently reconciling transactions with locking for Java synchronization.
Jean-Pierre Talpin: Virtual Prototyping Embedded Architectures
Unfortunately, I missed the introductorily part of these lectures due to problems with our accommodation *sigh*. However, as far as I understood, in the domain of embedded systems for large scale applications like aircrafts, there is always a need to combine a large number of different technologies, and the final goal is to simulate, analyze, and verify the software before it is deployed into production use. Especially the verification part is important for critical systems like in avionics.
The systems in this domain are mostly event driven, i.e., in his terminology it is an asynchronous composition of reactive processes.
For these purpose, they developed Polychrony including an Eclipse integration. It allows using data-flow models for computation and mode automata for control. These models are verified with model-checking techniques and allow controller synthesis.
At the heart of this system is a data-flow language, which allows expressing such an asynchronous event system in terms of synchronous modules, which have better characteristics with respect to verification. Furthermore, they established the means to prove certain correctness criteria.
For an aerospace project, they used this as a basis and code-generation framework. One of their techniques uses a SSA as an intermediate representation to model existing software. They use SSA to translate the input to their synchronous data-flow formalism enabling their analysis’.
Satnam Singh: Data-Parallel Programming
The motivation behind his work is to allow users to utilize parallelism in a civil manner, i.e., with a reasonable effort/gain ratio. His premise is that programming models like SIMD, OpenMP, and MPI are inherently low-level and require very detailed knowledge about the target architecture, which restricts their applicability. Not only increases it the development cost, but it also restricts the application to a single platform. With the Microsoft Accelerator, they develop a high-level data-parallel library to target various platforms, starting with standard multi-core CPUs with SIMD instruction extensions, GPGPUs, and FPGAs.
The user describes the computation by defining a data-flow graph, which describes the intention instead of the detailed mapping to an execution. Most notably, they provide parallel array structures, on which operations can be performed without necessarily using an index, but deferring these details to the runtime, which can optimize the relevant array operations for the target architecture. Furthermore, it provides certain standard mechanisms to work in a data-parallel way. This includes various operations on the parallel arrays, which for instance can be combined with appropriate reduce operations. An important operation he emphasized is the array shift operation, which is extensively used to describe necessary data transformations. Instead of doing these kind of transformations explicitly, the runtime system can highly-optimize them to the target platform.
The resulting program description is just-in-time compiled for the target machine, and only then executed. For the typical problems they are approaching, the construction of the program describing graph and the JIT overhead are negligible.
However, this form of nested data parallelism brings also some problems. Especially the distribution of the problem over the computing platform is hard since the shape of the resulting computational graph depends on the input data.
Peter Sewell: Low-Level Concurrency – For Real
The basic conclusion of his talk is: it is impossible to write portable, correct, and efficient concurrent code for any existing hardware platform, because they do not actually tell you all the constrains/properties in their specifications.
So he basically starts to rant about all processor manufacturers and their specs. He starts with a seemingly simple example, and demonstrates that the expected outcome is not guaranteed.
Then he goes on and describes various examples for x86 and relates them to the tricks a CPU applies, which result in rather weak memory models.
Based on that, he presents a model they developed called x86-TSO (Total Store Order), which allows to reason about the correctness of concurrent x86 programs. This model is eventually used to introduce the notion of triangular races. This allows to reason about a class of programs, which can contain data races, for instance lock implementations.
From here, he goes on to rant about all the other existing hardware architectures and their semantics or lack of specification. They tried for years to come up with a formal model like x86-TSO but failed.
To conclude his lectures, he presents 10 different options how to define a sane memory model for instance for a programming language. He starts out with option 1: DON’T i.e. no concurrency. Well, my personal impression still is, that this is his favorite option. It has several advantages. Especially, it is very simple. In the end he comes to option 10, which means to have a memory model similar to C++0x. It is a data-race free model, which provides low-level concurrency primitives, and allows to break abstractions. Beside the notion of the data-race free part, that is something, I have expected/thought about earlier for a good VM.