Notice: Undefined offset: 7 in /project/eecs/parlab/www/view/blog/index.php on line 17 View

Speculative Multithreading and Manycore

Posted by Bryan Catanzaro on February 2nd, 2007

Andy Glew has a lot of interesting things to say in response to Dave's presentation at Stanford this week:

I enjoyed watching your Stanford EE380 talk today. I actually agree
with most of what you said, especially the part about how we need to
get past MultiCore, 2/4/8/16/32 processing elements, and get to 100s
and more cores.

But you made an off-hand remark, something like: "If you know how to
get a modern processor to 10GHz with reasonable IPC and power, I want
to talk to you...". So I'm talking to you, not expecting anything to
come out of it, but just because... well, I want you to know that not
everyone thinks that single thread performance cannot be improved.
Although it turns out that I want to improve single thread performance
by speculative multithreading, and the actual chip looks very much
like the MultiCore chips the industry is moving towards, or the
ManyCore chip that you, Patterson, espouse.

---++ Disclaimer

This is personal opinion only. It is not the official position of any
of my present or past employers - e.g. neither Intel nor AMD. I do not
necessarily know what the official positions are, if the exist. I am
pretty sure that I have not divulged any Intel or AMD secrets.

---+ SpMT

I'll divide it into several issues:
* CPU microarchitecture
* parallel substrate architecture
* frequency

---++ CPU microarchitecture

If you look at Haitham Akkary's DMT, Dynamic Multithreading, work, or
what I generically call the SpMT research area - SpMT = speculative
multithreading - I think that you will see some potential for speeding
up single thread applications. But you do it by taking those single
threaded applications or codes or whatever, and dividing them into
multiple speculative threads. I personally am most comfortable with
the hardware approaches - fairly simple predictors, such as predicting
that certain function calls are independent possibly with - and
recording the speculative execution into "logs" that live in main
memory (or cached portions thereof). Re-executing from these logs
when threads join. It is straightforward to make the process of
executing from these logs fast - so long as you have speculated
correctly, the ILP is infinite. But, of course, you have to execute
well when you have not speculated correctly - in my experience, when
you have the right control path, but up to 10% of the instructions
have incorrect data values, and, more importantly, when your control
paths diverge (and then later reconverge). Recording to and from
these log buffers takes bandwidth, but is latency insensitive.

(Note: what I call SpMT should not be confused with Antonio Gonzales'
SpMT. I tend to use SpMT as a generic term; there are several
researchers working in the area with specific proposals. Including
me, from time to time, although never for more than a few months at a
time.)

---++ Parallel substrate

In this, I don't think we are so far apart. I see SpMT running
probably first on MultiCore, because that is what companies are
building. I'm not sure that hardware multithreading needs to be
combined with multiple cores, but if you have a multithreaded core you
can do it, and hardware multithreading makes it easier to do SpMT: you
start by forking a thread on a local processor, using, e.g. register
file cloning or (my specialty) renamer tricks, and then you migrate
the thread to a different core to run. I.e. hardware threading makes
forking cheaper and easier, and reduces the need to be so accurate as
to when to fork.

I doubt that I can take a single threaded integer application such as
GCC and run on 1000 processors - but running on 8 has proven
reasonable. I have encountered embarassingly parallel single threaded
applications that can use 1000s of threads/cores via SpMT - but they
are also easy to parallelize.

Overall I see SpMT mainly as a transition / coexistence mechanism: it
allows us to get benefits from MultiCore and ManyCore for some
existing applications, while waiting for your parallel computing
revolution to happen. And, as parallel stuff becomes more common,
SpMT and parallel stuff coexist. If you have enough parallel
applications to fill up all 1000 (or 32) cores, great; if not, and you
have a single threaded application that you want to speed up, SpMT
delivers a modest but significant performance boost.

How much performance? Overall, it seems to be the usual square-root
law: performance grows as the square root of the number of
transistors. For SpMT, performance grows as the square root of the
number of cores. It's not great, but it's better than nothing

What's the power cost? The substrate in which you execute SpMT is
pretty much the same as any CPU core. SpMT likes multithreading, as
mentioned above. And it needs this comparatively small wart on the
side, the "log". There doesn't need to be any snooping hardware: you
can verify speculation and commit results by executing (and verifying)
out of the log at maximum ILP. So, SpMT has a slight tendency to
prefer higher ILP machines, like, maybe, 4 integer ALUs with 2 load
ports. But nothing much beyond 4 makes sense, maybe 8 at a stretch.
Heck: on modern machines we are lucky to get 1 IPC: with SpMT we can
regularly get max ILP, say, 4, during replay.

The SpMT log is basically a device that records program execution. It
doesn't have to record everything, just enough to speed up execution
during replay/verify/reexecution. And, at least some parallel
programming people have been asking for stuff like this anyway to make
debugging easier.

Sure, you can add snooping hardware to speed up detection of
misspeculations. But snoopers are expensive and cost power. So it's
a tradeoff.

Performance wise, SpMT seems to provide the same benefit that pretty
much every CPU microarchitecture technique does: performance varying
as the square root of the number of transistors. Not so nice as
linear, but if you still believe in Moore's Law scaling, well, the
square root of an exponential is still an exponential. And it appears
that the square root law also applies to parallel programming.

---++ Frequency

I don't know that I really care about frequency - I'm quite happy to
ride the number of transistors curve. I.e. I am quite happy to use
lots of slow transitors, and just use the number of transistors,
arranged suitably, to speed things up.

But, your challenge was to run at a reasonable IPC at 10GHz, so I'll
just briefly say:

Some recent CPU microarchitectures tried to get frequency at any
cost. This approach was pushed by people who believe YOUR conventional
wisdom, that there is nothing you can do in CPU microarchitecture to
improve ILP. I never agreed with this attitude. I have always tried
to work on speeding up single threaded programs via microarchitecture,
not just by frequency. I have had to adjust my microarchitectures as
the frequency target changes, but my goal has always been the same:
speed up single threaded applications using parallelism.

(I have no problem with explicit parallelism. I like working on
it. But I also do not want to reject implicit parallelism, when it is
there to be found.)

(By the way, "speeding up single threaded applications using
parallelism"
is not necessarily the same as increasing ILP.
E.g. if frequency is increasing and the number of gates per clock
cycle is decreasing, then advanced microarchitectures may be necessary
to stay even - to stay at 1 IPC. Similarly if memory delays are
increasing.)

But there were some good ideas in these high frequency designs, as
well as bad. The good ideas involved taking unnecessary stuff out of
the critical loops of the CPU - e.g. did you know that on all integer
bypass paths on P6 had an 8 bit shifter to cope with the AH
registers? It's almost never used...

Increasing frequency by rid of unnecessary stuff is good. (Or may be
good; there are other reasons, such as tolerating on-die variation, to
go for slower clock frequencies.)

It is unfortunate that some of these good ideas were lost along with
the bad ideas. For example, "replay" pipelines now have a really bad
reputation. But there are ways to implement replay that do not have
the problems that some recent, pre-right-hand-turn, machines do or did.

The recently turned away from high frequencies had the same problem:
they were not really doing anything new, CPU microarchitecture wise,
except trying to enable 2-4X higher frequencies. They were still
basically first generation out-of-order machines at heart, and in many
ways were inferior OOO machinjes to their predecessors (and
successors) the P6 family and and AMD K7/K8.

There's maybe 2X frequency to be had by cleaning unnecessary stuff out
of the pipeline, keeping transistors the same speed. But creating a
microarchitecture just so that you can push circuit speed using
expensive power hungry techniques is borderline.

But I don't think that it is necessary to create a new
microarchitecture from scrach to clean up these stupidities. They can
be cleaned up incrementally, from version to version of the same CPU
microarchitecture.

Bottom line: yes, there's frequency to be gained. But not a
revolution. A cleanup.

---++ If this is such a good idea...

Then why isn't anybody doing it?

Well, I tried, and still want to.

Having witnessed project cancellations (not just my project, other
projects) at both Intel and AMD, I take the moral: nobody is going to
pay you to do a new out-of-order CPU core. It's too risky, and why
would a new OOO core be different or better than the old, polished up.
You have to do something different to justify designing a new CPU. Or,
better yet, take existing CPU cores, and do something different with
them.

I think to most people SpMT looks like both too much, and too little.
It doesn't really involve designing a CPU core from scratch - it
mainly takes existing CPU cores, and makes a few tweaks to support
SpMT. So you cannot justify a 1000 engineer team to work on it. In
that way, it is too little. And SpMT is too much, because it needs
multiple cores to be interesting. SpMT on a dual core is not very
interesting; on 4 cores, it starts getting interesting.

Anyway, enough on SpMT. I just wanted to tell you that there is at
least somebody in industry who hasn't given up on improving single
thread performance. But the things SpMT needs are almost exactly the
same things as ManyCore needs; it's a balanced wager. If you have
parallel applications, you run explicitly parallel.
If you have single threaded applications, you can speed them up too.




---+ Other stuff

---++ Number of Cores

I couldn't agree with you more about jumping to many cores. My dicta
is: We have both too much parallelism and too little parallelism. We
have too little parallelism to take advantage of a 16 CPU MultiCore.
But there are a whole slew of apps that can be written as 100s of
threads, and run on ManyCore.

In fact, I think that it is often easier for the programmer to write
the application as 1000s of threads. But the threads are irregular,
unbalanaced, etc. One of the most important things for parallel
programming systems will be to coalesce the 1000s of programmer
specified threads into a workload that can run efficiently on the
Intel/AMD provided 8-16 CPU MultiCore chips.

---++ Hardware User Level Threads

So, let me pitch an evolutionary step: basically Burton Smith style
multithreading.

Instruction set extensions that allow the programmer to specify
thousands of threads. And a hardware scheduler which timeslices them
onto a MultiCore - or a ManyCore with fewer cores than the user has
specified. Switch on event, switch on cache miss, etc.

A consideration which is irrelevant to academics, but matters to
industry: use this hardware scheduler behind the back of the OS.
Microsoft isn't going to ship a 100 CPU scheduler for consumer devices
next year. But if the consumer OS thinks there are only 4 or 8 CPUs,
we can let the hardware scheduler (a) use as many CPU cores as are
actually available, and (b) timelsice the still larger number of user
threads onto the hardware cores.

This way we can get programmers starting to use 100s and 1000s of
threads before the actual hardware is available. Once the highly
threaded software is available, then it is a question of "How many
hardware threads / cores run this highly threaded application best?"

---++ Google

You said it: Google is in the right position to take advantage of
ManyCore.

I see the computer industry as a 3 way competition:

The PC guys: Intel/AMD/Microsoft

versus the GPU guys Nvidia (and maybe AMD/ATI)

versus Google type centralized services

2 of these corners may gang up on the third. Or maybe everyone will
work together

With cell phones and Cisco sitting out there somewhere, waiting to
pile on.

White Paper Posted

Posted by Bryan Catanzaro on December 18th, 2006

The long awaited View white paper has been posted as a tech report!

Azul Vega2

Posted by Bryan Catanzaro on December 4th, 2006

Azul Systems announced their second generation machine today, which is designed for highly threaded Java workloads. Each of their Vega2 chips has 48 simple cores, arranged in 6 clusters of 8 processors each, connected around a central crossbar. Each chip has 4 on chip memory controllers as well as a dedicated SERDES link to every other chip in the system, to support a flat memory space.
They are claiming a world record score on SPECjbb2005 for systems running a single Java VM. In context, their score is about 6% better than a 128 chip/128 core Fujitsu Sparc64 machine running a single Java VM, only they're using just 4 chips (albeit with 192 cores).

References:

SiCortex Unveils

Posted by Bryan Catanzaro on November 10th, 2006

SiCortex went public this week with its custom MIPS HPC cluster-in-a-cabinet. I couldn't help but smile when I read this line from one of their white papers: "Contrary to conventional wisdom, high performance computers must dramatically reduce their power consumption in order to achieve their performance goals." Sound familiar? =)
SiCortex views parallel computing as primarily a data movement problem, so they chose simple, in order MIPS cores to reduce power consumption, allowing them to pack the processors very densely. They fit 6 MIPS cores & their caches, memory controllers, PCI Express controllers, and Fabric ports on each chip, then tile 27 of these chips along with their DDR2 memories onto each blade. Their high end machine has 36 blades, for a total of almost 6000 cores, consuming only 18 kW. Their low power consumption allows them to pack the chips very densely physically, which along with an interesting network topology, allows interprocessor communications to be very economical - at max only 6 hops, whereas a 3D torus interconnect would require 15.
SiCortex envisions MPI as the programming model of choice.
This product is some additional anecdotal evidence in support of our position that small, simple processors are the way to go.

Some links:

Nvidia GeForce 8800

Posted by Bryan Catanzaro on November 8th, 2006

Today is a big day in the GPU world, since Nvidia's releasing a brand new architecture! This one's sometimes referred to as G80, which will be productized into various shapes and sizes of GPU in the Geforce 8000 series.
This architecture is noteworthy in that it is the first GPU for the desktop with unified shaders (although the ATi GPU in the Xbox 360 had them a year ago). In other words, its processing elements are more general purpose and programmable than previous architectures, which had assemblages of more specialized units. The Geforce 8800, released today, has 128 of these shaders running at around 1.35 GHz. Nvidia is calling them "stream processors" in order to emphasize their more general purpose nature. Interestingly, they are scalar processors, which Nvidia claims offers more flexibility than the traditional vector processors used in GPUs.
The 128 stream processors are organized in hierarchical fashion: groups of 16 share an "L1" cache, and then 8 groups of shaders are connected via a crossbar to 6 rasterization groups, each of which contains an "L2" cache and a 64-bit channel to GPU memory.
Nvidia is touting the general purpose nature of this architecture, and claiming large advantages on many types of code over CPUs - like 10x on rigid body physics simulation to 200x on financial calculations. There are 2 separate drivers: one for games, and a new one for exposing the architecture to general purpose programming. The programming model looks to be a variant of C with threads.
This GPU is the first to support Microsoft's DX10 standard, which brings two important capabilities to the GPU: Virtual Memory, meaning that the graphics card's memory can now be virtualized by the OS, and Multitasking - in contrast to previous GPUs which assumed only one process was using the GPU at a given time. Also, a "higher" degree of compliance with 32-bit IEEE floating-point standards is claimed, although specifics weren't easy to find.
Benchmarks are impressive, as befits such a monstrous chip with almost 700M transistors. =)
This release continues the trend of GPUs becoming more programmable and intruding on traditional CPU territory. The stage continues to set up for a battle of titans over PC silicon between Intel, AMD/ATi and Nvidia.

Some good reviews: