Using Sequential Programming Models to Program Manycore Systems

Posted by John Shalf on March 1st, 2007

Wen-mei Hwu from University of Illinois presented an Keynote Talk at this year's IEEE MICRO-39 conference. He takes the the position that sequential programming models are the best approach to programming multicore systems.

The top 5 reasons listed in the presentation are:

  1. Developing a non-trivial explicitly parallel application is an expensive proposition.

  2. Verification and support of a hand-parallelized program is even more expensive.

  3. Programs developed in an explicitly parallel form are unlikely to scale with Moore's Law.

  4. Tools must have in-depth knowledge of the apps regardless of programming models being used anyways.

  5. Evolution of tools takes place at much faster pace than that of human race.

I'll make the following observations:

Wen-Mei's arguments are very appealing because such a path would enable us to side-step the spectre of a massive software re-engineering to conform to a new, and as-of-yet undecided programming model. This is particularly appealing for engineering applications that would be subject to an onerous recertification process if recoded to support massive parallelism.

Parallel Algorithms vs. Parallel Programming

While I agree with Hwu that the past 30 years of computing have yielded marginal advances in programming models or tools -- the advances are primarily found in the huge body of parallel algorithms that have been developed during this time. David Keyes observes in the SCaLeS report ( ) that advances in parallel algorithm research have improved code performance faster than Moore's law-driven improvements in hardware performance. Hwu's proposal does not appear to build on the parallel algorithms that the math and CS community have developed thus far. Going backwards to serial expression of algorithms is not a constructive approach that builds upon the existing body of accumulated knowledge.

Building on Existing Body of Work in Parallel Algorithms

I also agree with Hwu that the existing tools for expressing parallelism are unlikely to be adopted by the larger software development community due to their complexity. However, I think the explicitly parallel algorithms that have been developed using these older programming models are very relevant for future systems. The real problem, I would argue, is that we need a more elegant way of expressing explicitly parallel algorithms rather than going backward to serial algorithms and attempting to automatically infer a parallel implementation (synthesize them from first principles). This isn't to say that further advances in autoparallelization are useless, but it should not be considered as an alternative to leveraging the huge body of work developed by the applied math community. It should be just one of the tools in toolbox (not *the* tool that supplants all others).

We think these parallel algorithms will have broad impact across all problem domains (not just HPC). The primary differences between the algorithm requirements for embedded and HPC applications revolves around the data types (integer vs. double precision floating point) rather than a need for fundamentally different algorithms.

Isomorphic Transformations from Serial to Parallel Algorithm Rarely Exist

If there is anything we can learn from the last decade of parallel algorithm research, it is that the transformation from serial to parallel algorithms is not isomorphic. For example, I cannot imagine a set of code transformations that would convert a conventional N-body algorithm into fast-multipole (a method that offers much higher parallel efficiency) except to recognize the entire n-body algorithm and replace it with the parallel implementation. Such an approach would be impractical. Past experience with serial algorithms has shown that isomorphic transformations to parallel implementations tend to offer poorer efficiency compared to adopting an entirely new algorithm developed explicitly for parallel machines.

Expressing Parallel Algorithms in Serial Language is Difficult

As an example, please consider recent work by Frey and Dueck on an explicitly parallel algorithm that improves image recognition and classification performance. ( ). Such algorithms are central to Intel's RMS (Recognition Mining and Synthesis) applications that are expected to drive future performance improvements for desktop processors. Expressing such an algorithm in an explicitly sequential language would be far more difficult than using an explicitly parallel language. Many of the key algorithms that have enabled dramatic performance advances over the past decade would be extremely obtuse to express using an explicitly serial language.

Program implementations should be independent of the number of processors

With respect to Hwu's overall goal of making parallelism more implicit via directives, I don't see that as being very different from what we say in the Berkeley View: "A good programming model should be independent of number of processors." Once you consider all of the assertions and the directives that are under consideration in their effort, it starts to look a lot like the elements of the explicitly parallel languages that we advocate.

Guaranteeing Locality of Effect

Efficient parallelization requires strict boundaries on the scope of side-effects. The need for locality-of-effect has been examined in exquisite detail during the CS community's obsession with dataflow hardware back in the 1980s. Functional languages provide the most efficient approach to programming dataflow computing architectures because they provide strong semantics that enforce locality of effect, and therefore make safe, automatic parallelization tractable (iff you are willing to express your algorithm using a functional language).

I worry in general that serial languages do not provide the necessary semantic guarantees about locality of effect that is necessary for efficient parallelism. Ornamenting the language to insert the semantics of such guarantees (as we do with OpenMP scoping of arrays) is arduous, prone to error, and quite frankly not very intuitive. For instance, the last slide of Hwu's presentation describes the process of annotating user-defined memory allocation routines to provide hints to an auto-parallelizing compiler. I do not find the process Hwu describes to be the least bit intutivie, nor does it support Hwu's claim that such directives allow programmers to focus on algorithms. The same is true of many other semantic features that would require annotation. It seems that existing sequential languages underspecify the semantic guarantees that are required to ensure safe parallelization, and fixing those deficiencies after-the-fact with annotations seems more unfriendly to programmer productivity than other programming models that are currently being considered.

It is possible to leverage the syntax of existing languages, such as C, and add additional ornamentation to provide more specific guarantees of locality of effect that can be exploited by the compiler and runtime environment to facilitate parallelization. CILK is a good example of this approach. Using only a handful of keywords, one can ornament C functions or subroutines to guarantee locality of effect. The changes, in effect, turn those procedures into elements of a functional language. Functional languages provide the necessary guarantees regarding the scope of side-effects that enable the runtime environment to expose the necessary concurrency.

However, the programming style necessary to exploit CILK's stricter guarantees of locality-of-effect is fundamentally different than the way you would express a problem serially. Although it is true that a code that has been ornamented using CILK keywords can be run correctly on a serial processor (essentially ignoring the advice of these keywords), it is not true that applying the keywords to any arbitrary piece of serial code will result in a parallel speedup. One cannot achieve a speedup using the conventional serial algorithm -- they must adopt new algorithms.

So the method of expressing a problem is entirely different from the question of algorithms -- these are two different things. You really need explicitly parallel algorithms to achieve good parallel efficiency, even if you leverage elements of existing serial languages. Compilers offer an extremely limited ability to change algorithmic implementation since the transformation between the serial and parallel algorithm are not necessarily isomorphic. Ultimately, we cannot escape the pain points of implementing parallelism by adopting a serial language.

I'll finish by responding directly to the top-5 list of reasons to adopt a serial programming model.

Responses to Top5-list of reasons to use a serial programming model

  1. Developing a non-trivial explicitly parallel application is an expensive proposition Comment: What if the algorithm is parallel? Is the issue perhaps that we need a better environment to describe parallel algorithms than the current MPI+Fortran ecosystem? Its not a problem with parallelism, its a problem with the medium of expression that we currently employ for parallel algorithms.

  2. Verification and support of a hand-parallelized program is even more expensive.Comment: Even if you adopt a strictly serial programming model, verification of the facilities that transform a serial algorithm into parallel form will be equally daunting. In that sense, the sequential specification of the algorithm is not saving you from the verification step. Is the issue that we need to invest in tools that enable such verification regardless of how the problem is originally specified? Such technology is required by all, regardless of our views on the medium used for expressing the underlying algorithm.

  3. Programs developed in a explicitly parallel form are unlikely to scale with Moore's Law. Comment: Performance comes from development of better algorithms. In fact, the 15 years of development of parallel algorithms has actually exceeded growth rate of system performance. Why would we throw that away?

    David Patterson adds the following observation: My comment; compiler optimization is extraordinariliy unlikely to grow with Moore's Law. If it did, then we should have been able to make VLIW a success, as we could have uncovered parallelism that we could have turned into ILP. History has written a different ending to the VLIW story.

  4. Tools must have in-depth knowledge of the apps regardless of programming models being used anyways. Comment: which is why I am dubious that a compiler will do a better job of inferring an efficient parallel algorithm than a programmer who is provided an elegant explicitly parallel medium for expressing a parallel algorithm. The developer has a better knowledge of the domain than a compiler will.

  5. Evolution of tools takes place at much faster pace than that of human race. Comment: This is also true for the development of new algorithms by the math community. The math community does a better job of exploiting domain knowledge to design better parallel algorithms than compiler developers can.

Promise and Perils

Posted by Joseph Gebis on February 27th, 2007

CTWatch Quarterly, an online journal covering trends in high-performance computing technology, has dedicated its most recent issue to the
"Coming Multicore Revolution"
. It discusses issues and challenges of the multicore trend from a variety of hardware and software perspectives. The Berkeley View report is explicitly cited.

Reinventing Computing for the Multicore Inflection Point

Posted by Bryan Catanzaro on February 26th, 2007

Burton Smith (Microsoft) will be giving a talk on reinventing computing for multicore for the IEEE Technical Committee on Parallel Processing general membership meeting (March 6 in Long Beach, CA). Looks interesting - it'd be great to have a report from someone who can attend.

The many-core inflection point presents a new challenge for our industry, namely general-purpose parallel computing. Unless this challenge is met, the continued growth and importance of computing itself and of the businesses engaged in it are at risk. We must make parallel programming easier and more generally applicable than it is now, and build hardware and software that will execute arbitrary parallel programs on whatever scale of system the user has. The changes needed to accomplish this are significant and affect computer architecture, the entire software development tool chain, and the army of application developers that will rely on those tools to develop parallel applications. This talk will point out a few of the hard problems that face us and some prospects for addressing them.

NYTimes Article on Intel Teraflop project mentions View

Posted by Bryan Catanzaro on February 12th, 2007


Posted by Bryan Catanzaro on February 12th, 2007

The View project gets exposure on Slashdot.

Memory Instructions for Manycore

Posted by Bryan Catanzaro on February 2nd, 2007

Steven Ericsson-Zenith responds to Dave's presentation at Stanford:

I enjoyed your lecture yesterday (I am one of the transputer/occam architects and spoke to you at the end).

I have quickly reviewed the RAMP materials and find them interesting. I have my attention on the theoretical foundations of logic, recognition and motility currently but I would be interested in participating at some level.

I can offer a solution for you to consider in the memory/storage, synchronization, instruction set - that arose from my consideration of the issues at INMOS and YALE (I worked with Gelernter too).

In my 1992 Ph.D. thesis I propose extending the typing system of parallel programming systems (consistent with your suggestion yesterday) and added four complementary primitive instructions :

put and get, read and write

These are aimed at resolving the copy penalty that will occur when you move data around in a parallel system. Simplistically, it is a way of encapsulating pointers and access synchronization. The proposal has the advantage of a formal basis.

You can find my thesis online at:

Speculative Multithreading and Manycore

Posted by Bryan Catanzaro on February 2nd, 2007

Andy Glew has a lot of interesting things to say in response to Dave's presentation at Stanford this week:

I enjoyed watching your Stanford EE380 talk today. I actually agree
with most of what you said, especially the part about how we need to
get past MultiCore, 2/4/8/16/32 processing elements, and get to 100s
and more cores.

But you made an off-hand remark, something like: "If you know how to
get a modern processor to 10GHz with reasonable IPC and power, I want
to talk to you...". So I'm talking to you, not expecting anything to
come out of it, but just because... well, I want you to know that not
everyone thinks that single thread performance cannot be improved.
Although it turns out that I want to improve single thread performance
by speculative multithreading, and the actual chip looks very much
like the MultiCore chips the industry is moving towards, or the
ManyCore chip that you, Patterson, espouse.

---++ Disclaimer

This is personal opinion only. It is not the official position of any
of my present or past employers - e.g. neither Intel nor AMD. I do not
necessarily know what the official positions are, if the exist. I am
pretty sure that I have not divulged any Intel or AMD secrets.

---+ SpMT

I'll divide it into several issues:
* CPU microarchitecture
* parallel substrate architecture
* frequency

---++ CPU microarchitecture

If you look at Haitham Akkary's DMT, Dynamic Multithreading, work, or
what I generically call the SpMT research area - SpMT = speculative
multithreading - I think that you will see some potential for speeding
up single thread applications. But you do it by taking those single
threaded applications or codes or whatever, and dividing them into
multiple speculative threads. I personally am most comfortable with
the hardware approaches - fairly simple predictors, such as predicting
that certain function calls are independent possibly with - and
recording the speculative execution into "logs" that live in main
memory (or cached portions thereof). Re-executing from these logs
when threads join. It is straightforward to make the process of
executing from these logs fast - so long as you have speculated
correctly, the ILP is infinite. But, of course, you have to execute
well when you have not speculated correctly - in my experience, when
you have the right control path, but up to 10% of the instructions
have incorrect data values, and, more importantly, when your control
paths diverge (and then later reconverge). Recording to and from
these log buffers takes bandwidth, but is latency insensitive.

(Note: what I call SpMT should not be confused with Antonio Gonzales'
SpMT. I tend to use SpMT as a generic term; there are several
researchers working in the area with specific proposals. Including
me, from time to time, although never for more than a few months at a

---++ Parallel substrate

In this, I don't think we are so far apart. I see SpMT running
probably first on MultiCore, because that is what companies are
building. I'm not sure that hardware multithreading needs to be
combined with multiple cores, but if you have a multithreaded core you
can do it, and hardware multithreading makes it easier to do SpMT: you
start by forking a thread on a local processor, using, e.g. register
file cloning or (my specialty) renamer tricks, and then you migrate
the thread to a different core to run. I.e. hardware threading makes
forking cheaper and easier, and reduces the need to be so accurate as
to when to fork.

I doubt that I can take a single threaded integer application such as
GCC and run on 1000 processors - but running on 8 has proven
reasonable. I have encountered embarassingly parallel single threaded
applications that can use 1000s of threads/cores via SpMT - but they
are also easy to parallelize.

Overall I see SpMT mainly as a transition / coexistence mechanism: it
allows us to get benefits from MultiCore and ManyCore for some
existing applications, while waiting for your parallel computing
revolution to happen. And, as parallel stuff becomes more common,
SpMT and parallel stuff coexist. If you have enough parallel
applications to fill up all 1000 (or 32) cores, great; if not, and you
have a single threaded application that you want to speed up, SpMT
delivers a modest but significant performance boost.

How much performance? Overall, it seems to be the usual square-root
law: performance grows as the square root of the number of
transistors. For SpMT, performance grows as the square root of the
number of cores. It's not great, but it's better than nothing

What's the power cost? The substrate in which you execute SpMT is
pretty much the same as any CPU core. SpMT likes multithreading, as
mentioned above. And it needs this comparatively small wart on the
side, the "log". There doesn't need to be any snooping hardware: you
can verify speculation and commit results by executing (and verifying)
out of the log at maximum ILP. So, SpMT has a slight tendency to
prefer higher ILP machines, like, maybe, 4 integer ALUs with 2 load
ports. But nothing much beyond 4 makes sense, maybe 8 at a stretch.
Heck: on modern machines we are lucky to get 1 IPC: with SpMT we can
regularly get max ILP, say, 4, during replay.

The SpMT log is basically a device that records program execution. It
doesn't have to record everything, just enough to speed up execution
during replay/verify/reexecution. And, at least some parallel
programming people have been asking for stuff like this anyway to make
debugging easier.

Sure, you can add snooping hardware to speed up detection of
misspeculations. But snoopers are expensive and cost power. So it's
a tradeoff.

Performance wise, SpMT seems to provide the same benefit that pretty
much every CPU microarchitecture technique does: performance varying
as the square root of the number of transistors. Not so nice as
linear, but if you still believe in Moore's Law scaling, well, the
square root of an exponential is still an exponential. And it appears
that the square root law also applies to parallel programming.

---++ Frequency

I don't know that I really care about frequency - I'm quite happy to
ride the number of transistors curve. I.e. I am quite happy to use
lots of slow transitors, and just use the number of transistors,
arranged suitably, to speed things up.

But, your challenge was to run at a reasonable IPC at 10GHz, so I'll
just briefly say:

Some recent CPU microarchitectures tried to get frequency at any
cost. This approach was pushed by people who believe YOUR conventional
wisdom, that there is nothing you can do in CPU microarchitecture to
improve ILP. I never agreed with this attitude. I have always tried
to work on speeding up single threaded programs via microarchitecture,
not just by frequency. I have had to adjust my microarchitectures as
the frequency target changes, but my goal has always been the same:
speed up single threaded applications using parallelism.

(I have no problem with explicit parallelism. I like working on
it. But I also do not want to reject implicit parallelism, when it is
there to be found.)

(By the way, "speeding up single threaded applications using
is not necessarily the same as increasing ILP.
E.g. if frequency is increasing and the number of gates per clock
cycle is decreasing, then advanced microarchitectures may be necessary
to stay even - to stay at 1 IPC. Similarly if memory delays are

But there were some good ideas in these high frequency designs, as
well as bad. The good ideas involved taking unnecessary stuff out of
the critical loops of the CPU - e.g. did you know that on all integer
bypass paths on P6 had an 8 bit shifter to cope with the AH
registers? It's almost never used...

Increasing frequency by rid of unnecessary stuff is good. (Or may be
good; there are other reasons, such as tolerating on-die variation, to
go for slower clock frequencies.)

It is unfortunate that some of these good ideas were lost along with
the bad ideas. For example, "replay" pipelines now have a really bad
reputation. But there are ways to implement replay that do not have
the problems that some recent, pre-right-hand-turn, machines do or did.

The recently turned away from high frequencies had the same problem:
they were not really doing anything new, CPU microarchitecture wise,
except trying to enable 2-4X higher frequencies. They were still
basically first generation out-of-order machines at heart, and in many
ways were inferior OOO machinjes to their predecessors (and
successors) the P6 family and and AMD K7/K8.

There's maybe 2X frequency to be had by cleaning unnecessary stuff out
of the pipeline, keeping transistors the same speed. But creating a
microarchitecture just so that you can push circuit speed using
expensive power hungry techniques is borderline.

But I don't think that it is necessary to create a new
microarchitecture from scrach to clean up these stupidities. They can
be cleaned up incrementally, from version to version of the same CPU

Bottom line: yes, there's frequency to be gained. But not a
revolution. A cleanup.

---++ If this is such a good idea...

Then why isn't anybody doing it?

Well, I tried, and still want to.

Having witnessed project cancellations (not just my project, other
projects) at both Intel and AMD, I take the moral: nobody is going to
pay you to do a new out-of-order CPU core. It's too risky, and why
would a new OOO core be different or better than the old, polished up.
You have to do something different to justify designing a new CPU. Or,
better yet, take existing CPU cores, and do something different with

I think to most people SpMT looks like both too much, and too little.
It doesn't really involve designing a CPU core from scratch - it
mainly takes existing CPU cores, and makes a few tweaks to support
SpMT. So you cannot justify a 1000 engineer team to work on it. In
that way, it is too little. And SpMT is too much, because it needs
multiple cores to be interesting. SpMT on a dual core is not very
interesting; on 4 cores, it starts getting interesting.

Anyway, enough on SpMT. I just wanted to tell you that there is at
least somebody in industry who hasn't given up on improving single
thread performance. But the things SpMT needs are almost exactly the
same things as ManyCore needs; it's a balanced wager. If you have
parallel applications, you run explicitly parallel.
If you have single threaded applications, you can speed them up too.

---+ Other stuff

---++ Number of Cores

I couldn't agree with you more about jumping to many cores. My dicta
is: We have both too much parallelism and too little parallelism. We
have too little parallelism to take advantage of a 16 CPU MultiCore.
But there are a whole slew of apps that can be written as 100s of
threads, and run on ManyCore.

In fact, I think that it is often easier for the programmer to write
the application as 1000s of threads. But the threads are irregular,
unbalanaced, etc. One of the most important things for parallel
programming systems will be to coalesce the 1000s of programmer
specified threads into a workload that can run efficiently on the
Intel/AMD provided 8-16 CPU MultiCore chips.

---++ Hardware User Level Threads

So, let me pitch an evolutionary step: basically Burton Smith style

Instruction set extensions that allow the programmer to specify
thousands of threads. And a hardware scheduler which timeslices them
onto a MultiCore - or a ManyCore with fewer cores than the user has
specified. Switch on event, switch on cache miss, etc.

A consideration which is irrelevant to academics, but matters to
industry: use this hardware scheduler behind the back of the OS.
Microsoft isn't going to ship a 100 CPU scheduler for consumer devices
next year. But if the consumer OS thinks there are only 4 or 8 CPUs,
we can let the hardware scheduler (a) use as many CPU cores as are
actually available, and (b) timelsice the still larger number of user
threads onto the hardware cores.

This way we can get programmers starting to use 100s and 1000s of
threads before the actual hardware is available. Once the highly
threaded software is available, then it is a question of "How many
hardware threads / cores run this highly threaded application best?"

---++ Google

You said it: Google is in the right position to take advantage of

I see the computer industry as a 3 way competition:

The PC guys: Intel/AMD/Microsoft

versus the GPU guys Nvidia (and maybe AMD/ATI)

versus Google type centralized services

2 of these corners may gang up on the third. Or maybe everyone will
work together

With cell phones and Cisco sitting out there somewhere, waiting to
pile on.

Colloquium video posted

Posted by Bryan Catanzaro on January 31st, 2007

Video from the colloquium talk presented by Dave Patterson is now posted. It also includes a Q&A with Kurt Keutzer and Krste Asanovic.

Parallel programming project recognized as a Winner in IEEE Spectrum's Best and Worst Tech Projects for 2007

Posted by Jike Chong on January 26th, 2007

Rapidmind, a startup based in Waterloo, Ontario, Canada, was named a winner in the software category in January, 2007 issue of IEEE Spectrum. The company develops a platform that helps programmers realize the potential of parallel platforms such as GPUs and the Cell processor. The core technology is based on the programming language "Sh", which was developed at University of Waterloo, Ontario by Stefanus Du Toit and Michael McCool. The technology uses C++ as user interface, and asks user to re-write critical parallel sections in "Sh". It exploits data-level parallelism of the applications, and is able to map the same code base onto different platforms through dynamic compilation of higher level "sh" data constructs.

IEEE Spectrum January 2007 article
Rapidmind Whitepaper

White Paper Posted

Posted by Bryan Catanzaro on December 18th, 2006

The long awaited View white paper has been posted as a tech report!