Notice: Undefined index: bBcomment in /project/eecs/parlab/www/view/blog/index.php on line 5 Using Sequential Programming Models to Program Manycore Systems - View

Using Sequential Programming Models to Program Manycore Systems

Posted by John Shalf on March 1, 2007

Wen-mei Hwu from University of Illinois presented an Keynote Talk at this year's IEEE MICRO-39 conference. He takes the the position that sequential programming models are the best approach to programming multicore systems.

The top 5 reasons listed in the presentation are:

  1. Developing a non-trivial explicitly parallel application is an expensive proposition.

  2. Verification and support of a hand-parallelized program is even more expensive.

  3. Programs developed in an explicitly parallel form are unlikely to scale with Moore's Law.

  4. Tools must have in-depth knowledge of the apps regardless of programming models being used anyways.

  5. Evolution of tools takes place at much faster pace than that of human race.

I'll make the following observations:

Wen-Mei's arguments are very appealing because such a path would enable us to side-step the spectre of a massive software re-engineering to conform to a new, and as-of-yet undecided programming model. This is particularly appealing for engineering applications that would be subject to an onerous recertification process if recoded to support massive parallelism.

Parallel Algorithms vs. Parallel Programming

While I agree with Hwu that the past 30 years of computing have yielded marginal advances in programming models or tools -- the advances are primarily found in the huge body of parallel algorithms that have been developed during this time. David Keyes observes in the SCaLeS report ( ) that advances in parallel algorithm research have improved code performance faster than Moore's law-driven improvements in hardware performance. Hwu's proposal does not appear to build on the parallel algorithms that the math and CS community have developed thus far. Going backwards to serial expression of algorithms is not a constructive approach that builds upon the existing body of accumulated knowledge.

Building on Existing Body of Work in Parallel Algorithms

I also agree with Hwu that the existing tools for expressing parallelism are unlikely to be adopted by the larger software development community due to their complexity. However, I think the explicitly parallel algorithms that have been developed using these older programming models are very relevant for future systems. The real problem, I would argue, is that we need a more elegant way of expressing explicitly parallel algorithms rather than going backward to serial algorithms and attempting to automatically infer a parallel implementation (synthesize them from first principles). This isn't to say that further advances in autoparallelization are useless, but it should not be considered as an alternative to leveraging the huge body of work developed by the applied math community. It should be just one of the tools in toolbox (not *the* tool that supplants all others).

We think these parallel algorithms will have broad impact across all problem domains (not just HPC). The primary differences between the algorithm requirements for embedded and HPC applications revolves around the data types (integer vs. double precision floating point) rather than a need for fundamentally different algorithms.

Isomorphic Transformations from Serial to Parallel Algorithm Rarely Exist

If there is anything we can learn from the last decade of parallel algorithm research, it is that the transformation from serial to parallel algorithms is not isomorphic. For example, I cannot imagine a set of code transformations that would convert a conventional N-body algorithm into fast-multipole (a method that offers much higher parallel efficiency) except to recognize the entire n-body algorithm and replace it with the parallel implementation. Such an approach would be impractical. Past experience with serial algorithms has shown that isomorphic transformations to parallel implementations tend to offer poorer efficiency compared to adopting an entirely new algorithm developed explicitly for parallel machines.

Expressing Parallel Algorithms in Serial Language is Difficult

As an example, please consider recent work by Frey and Dueck on an explicitly parallel algorithm that improves image recognition and classification performance. ( ). Such algorithms are central to Intel's RMS (Recognition Mining and Synthesis) applications that are expected to drive future performance improvements for desktop processors. Expressing such an algorithm in an explicitly sequential language would be far more difficult than using an explicitly parallel language. Many of the key algorithms that have enabled dramatic performance advances over the past decade would be extremely obtuse to express using an explicitly serial language.

Program implementations should be independent of the number of processors

With respect to Hwu's overall goal of making parallelism more implicit via directives, I don't see that as being very different from what we say in the Berkeley View: "A good programming model should be independent of number of processors." Once you consider all of the assertions and the directives that are under consideration in their effort, it starts to look a lot like the elements of the explicitly parallel languages that we advocate.

Guaranteeing Locality of Effect

Efficient parallelization requires strict boundaries on the scope of side-effects. The need for locality-of-effect has been examined in exquisite detail during the CS community's obsession with dataflow hardware back in the 1980s. Functional languages provide the most efficient approach to programming dataflow computing architectures because they provide strong semantics that enforce locality of effect, and therefore make safe, automatic parallelization tractable (iff you are willing to express your algorithm using a functional language).

I worry in general that serial languages do not provide the necessary semantic guarantees about locality of effect that is necessary for efficient parallelism. Ornamenting the language to insert the semantics of such guarantees (as we do with OpenMP scoping of arrays) is arduous, prone to error, and quite frankly not very intuitive. For instance, the last slide of Hwu's presentation describes the process of annotating user-defined memory allocation routines to provide hints to an auto-parallelizing compiler. I do not find the process Hwu describes to be the least bit intutivie, nor does it support Hwu's claim that such directives allow programmers to focus on algorithms. The same is true of many other semantic features that would require annotation. It seems that existing sequential languages underspecify the semantic guarantees that are required to ensure safe parallelization, and fixing those deficiencies after-the-fact with annotations seems more unfriendly to programmer productivity than other programming models that are currently being considered.

It is possible to leverage the syntax of existing languages, such as C, and add additional ornamentation to provide more specific guarantees of locality of effect that can be exploited by the compiler and runtime environment to facilitate parallelization. CILK is a good example of this approach. Using only a handful of keywords, one can ornament C functions or subroutines to guarantee locality of effect. The changes, in effect, turn those procedures into elements of a functional language. Functional languages provide the necessary guarantees regarding the scope of side-effects that enable the runtime environment to expose the necessary concurrency.

However, the programming style necessary to exploit CILK's stricter guarantees of locality-of-effect is fundamentally different than the way you would express a problem serially. Although it is true that a code that has been ornamented using CILK keywords can be run correctly on a serial processor (essentially ignoring the advice of these keywords), it is not true that applying the keywords to any arbitrary piece of serial code will result in a parallel speedup. One cannot achieve a speedup using the conventional serial algorithm -- they must adopt new algorithms.

So the method of expressing a problem is entirely different from the question of algorithms -- these are two different things. You really need explicitly parallel algorithms to achieve good parallel efficiency, even if you leverage elements of existing serial languages. Compilers offer an extremely limited ability to change algorithmic implementation since the transformation between the serial and parallel algorithm are not necessarily isomorphic. Ultimately, we cannot escape the pain points of implementing parallelism by adopting a serial language.

I'll finish by responding directly to the top-5 list of reasons to adopt a serial programming model.

Responses to Top5-list of reasons to use a serial programming model

  1. Developing a non-trivial explicitly parallel application is an expensive proposition Comment: What if the algorithm is parallel? Is the issue perhaps that we need a better environment to describe parallel algorithms than the current MPI+Fortran ecosystem? Its not a problem with parallelism, its a problem with the medium of expression that we currently employ for parallel algorithms.

  2. Verification and support of a hand-parallelized program is even more expensive.Comment: Even if you adopt a strictly serial programming model, verification of the facilities that transform a serial algorithm into parallel form will be equally daunting. In that sense, the sequential specification of the algorithm is not saving you from the verification step. Is the issue that we need to invest in tools that enable such verification regardless of how the problem is originally specified? Such technology is required by all, regardless of our views on the medium used for expressing the underlying algorithm.

  3. Programs developed in a explicitly parallel form are unlikely to scale with Moore's Law. Comment: Performance comes from development of better algorithms. In fact, the 15 years of development of parallel algorithms has actually exceeded growth rate of system performance. Why would we throw that away?

    David Patterson adds the following observation: My comment; compiler optimization is extraordinariliy unlikely to grow with Moore's Law. If it did, then we should have been able to make VLIW a success, as we could have uncovered parallelism that we could have turned into ILP. History has written a different ending to the VLIW story.

  4. Tools must have in-depth knowledge of the apps regardless of programming models being used anyways. Comment: which is why I am dubious that a compiler will do a better job of inferring an efficient parallel algorithm than a programmer who is provided an elegant explicitly parallel medium for expressing a parallel algorithm. The developer has a better knowledge of the domain than a compiler will.

  5. Evolution of tools takes place at much faster pace than that of human race. Comment: This is also true for the development of new algorithms by the math community. The math community does a better job of exploiting domain knowledge to design better parallel algorithms than compiler developers can.


RSS feed for comments on this post.

Leave a Comment

Sorry, Comments have been disabled for this post