I enjoyed watching your Stanford EE380 talk today. I actually agree
with most of what you said, especially the part about how we need to
get past MultiCore, 2/4/8/16/32 processing elements, and get to 100s
and more cores.
But you made an off-hand remark, something like: "If you know how to
get a modern processor to 10GHz with reasonable IPC and power, I want
to talk to you...". So I'm talking to you, not expecting anything to
come out of it, but just because... well, I want you to know that not
everyone thinks that single thread performance cannot be improved.
Although it turns out that I want to improve single thread performance
by speculative multithreading, and the actual chip looks very much
like the MultiCore chips the industry is moving towards, or the
ManyCore chip that you, Patterson, espouse.
This is personal opinion only. It is not the official position of any
of my present or past employers - e.g. neither Intel nor AMD. I do not
necessarily know what the official positions are, if the exist. I am
pretty sure that I have not divulged any Intel or AMD secrets.
I'll divide it into several issues:
* CPU microarchitecture
* parallel substrate architecture
---++ CPU microarchitecture
If you look at Haitham Akkary's DMT, Dynamic Multithreading, work, or
what I generically call the SpMT research area - SpMT = speculative
multithreading - I think that you will see some potential for speeding
up single thread applications. But you do it by taking those single
threaded applications or codes or whatever, and dividing them into
multiple speculative threads. I personally am most comfortable with
the hardware approaches - fairly simple predictors, such as predicting
that certain function calls are independent possibly with - and
recording the speculative execution into "logs" that live in main
memory (or cached portions thereof). Re-executing from these logs
when threads join. It is straightforward to make the process of
executing from these logs fast - so long as you have speculated
correctly, the ILP is infinite. But, of course, you have to execute
well when you have not speculated correctly - in my experience, when
you have the right control path, but up to 10% of the instructions
have incorrect data values, and, more importantly, when your control
paths diverge (and then later reconverge). Recording to and from
these log buffers takes bandwidth, but is latency insensitive.
(Note: what I call SpMT should not be confused with Antonio Gonzales'
SpMT. I tend to use SpMT as a generic term; there are several
researchers working in the area with specific proposals. Including
me, from time to time, although never for more than a few months at a
---++ Parallel substrate
In this, I don't think we are so far apart. I see SpMT running
probably first on MultiCore, because that is what companies are
building. I'm not sure that hardware multithreading needs to be
combined with multiple cores, but if you have a multithreaded core you
can do it, and hardware multithreading makes it easier to do SpMT: you
start by forking a thread on a local processor, using, e.g. register
file cloning or (my specialty) renamer tricks, and then you migrate
the thread to a different core to run. I.e. hardware threading makes
forking cheaper and easier, and reduces the need to be so accurate as
to when to fork.
I doubt that I can take a single threaded integer application such as
GCC and run on 1000 processors - but running on 8 has proven
reasonable. I have encountered embarassingly parallel single threaded
applications that can use 1000s of threads/cores via SpMT - but they
are also easy to parallelize.
Overall I see SpMT mainly as a transition / coexistence mechanism: it
allows us to get benefits from MultiCore and ManyCore for some
existing applications, while waiting for your parallel computing
revolution to happen. And, as parallel stuff becomes more common,
SpMT and parallel stuff coexist. If you have enough parallel
applications to fill up all 1000 (or 32) cores, great; if not, and you
have a single threaded application that you want to speed up, SpMT
delivers a modest but significant performance boost.
How much performance? Overall, it seems to be the usual square-root
law: performance grows as the square root of the number of
transistors. For SpMT, performance grows as the square root of the
number of cores. It's not great, but it's better than nothing
What's the power cost? The substrate in which you execute SpMT is
pretty much the same as any CPU core. SpMT likes multithreading, as
mentioned above. And it needs this comparatively small wart on the
side, the "log". There doesn't need to be any snooping hardware: you
can verify speculation and commit results by executing (and verifying)
out of the log at maximum ILP. So, SpMT has a slight tendency to
prefer higher ILP machines, like, maybe, 4 integer ALUs with 2 load
ports. But nothing much beyond 4 makes sense, maybe 8 at a stretch.
Heck: on modern machines we are lucky to get 1 IPC: with SpMT we can
regularly get max ILP, say, 4, during replay.
The SpMT log is basically a device that records program execution. It
doesn't have to record everything, just enough to speed up execution
during replay/verify/reexecution. And, at least some parallel
programming people have been asking for stuff like this anyway to make
Sure, you can add snooping hardware to speed up detection of
misspeculations. But snoopers are expensive and cost power. So it's
Performance wise, SpMT seems to provide the same benefit that pretty
much every CPU microarchitecture technique does: performance varying
as the square root of the number of transistors. Not so nice as
linear, but if you still believe in Moore's Law scaling, well, the
square root of an exponential is still an exponential. And it appears
that the square root law also applies to parallel programming.
I don't know that I really care about frequency - I'm quite happy to
ride the number of transistors curve. I.e. I am quite happy to use
lots of slow transitors, and just use the number of transistors,
arranged suitably, to speed things up.
But, your challenge was to run at a reasonable IPC at 10GHz, so I'll
just briefly say:
Some recent CPU microarchitectures tried to get frequency at any
cost. This approach was pushed by people who believe YOUR conventional
wisdom, that there is nothing you can do in CPU microarchitecture to
improve ILP. I never agreed with this attitude. I have always tried
to work on speeding up single threaded programs via microarchitecture,
not just by frequency. I have had to adjust my microarchitectures as
the frequency target changes, but my goal has always been the same:
speed up single threaded applications using parallelism.
(I have no problem with explicit parallelism. I like working on
it. But I also do not want to reject implicit parallelism, when it is
there to be found.)
(By the way, "speeding up single threaded applications using
is not necessarily the same as increasing ILP.
E.g. if frequency is increasing and the number of gates per clock
cycle is decreasing, then advanced microarchitectures may be necessary
to stay even - to stay at 1 IPC. Similarly if memory delays are
But there were some good ideas in these high frequency designs, as
well as bad. The good ideas involved taking unnecessary stuff out of
the critical loops of the CPU - e.g. did you know that on all integer
bypass paths on P6 had an 8 bit shifter to cope with the AH
registers? It's almost never used...
Increasing frequency by rid of unnecessary stuff is good. (Or may be
good; there are other reasons, such as tolerating on-die variation, to
go for slower clock frequencies.)
It is unfortunate that some of these good ideas were lost along with
the bad ideas. For example, "replay" pipelines now have a really bad
reputation. But there are ways to implement replay that do not have
the problems that some recent, pre-right-hand-turn, machines do or did.
The recently turned away from high frequencies had the same problem:
they were not really doing anything new, CPU microarchitecture wise,
except trying to enable 2-4X higher frequencies. They were still
basically first generation out-of-order machines at heart, and in many
ways were inferior OOO machinjes to their predecessors (and
successors) the P6 family and and AMD K7/K8.
There's maybe 2X frequency to be had by cleaning unnecessary stuff out
of the pipeline, keeping transistors the same speed. But creating a
microarchitecture just so that you can push circuit speed using
expensive power hungry techniques is borderline.
But I don't think that it is necessary to create a new
microarchitecture from scrach to clean up these stupidities. They can
be cleaned up incrementally, from version to version of the same CPU
Bottom line: yes, there's frequency to be gained. But not a
revolution. A cleanup.
---++ If this is such a good idea...
Then why isn't anybody doing it?
Well, I tried, and still want to.
Having witnessed project cancellations (not just my project, other
projects) at both Intel and AMD, I take the moral: nobody is going to
pay you to do a new out-of-order CPU core. It's too risky, and why
would a new OOO core be different or better than the old, polished up.
You have to do something different to justify designing a new CPU. Or,
better yet, take existing CPU cores, and do something different with
I think to most people SpMT looks like both too much, and too little.
It doesn't really involve designing a CPU core from scratch - it
mainly takes existing CPU cores, and makes a few tweaks to support
SpMT. So you cannot justify a 1000 engineer team to work on it. In
that way, it is too little. And SpMT is too much, because it needs
multiple cores to be interesting. SpMT on a dual core is not very
interesting; on 4 cores, it starts getting interesting.
Anyway, enough on SpMT. I just wanted to tell you that there is at
least somebody in industry who hasn't given up on improving single
thread performance. But the things SpMT needs are almost exactly the
same things as ManyCore needs; it's a balanced wager. If you have
parallel applications, you run explicitly parallel.
If you have single threaded applications, you can speed them up too.
---+ Other stuff
---++ Number of Cores
I couldn't agree with you more about jumping to many cores. My dicta
is: We have both too much parallelism and too little parallelism. We
have too little parallelism to take advantage of a 16 CPU MultiCore.
But there are a whole slew of apps that can be written as 100s of
threads, and run on ManyCore.
In fact, I think that it is often easier for the programmer to write
the application as 1000s of threads. But the threads are irregular,
unbalanaced, etc. One of the most important things for parallel
programming systems will be to coalesce the 1000s of programmer
specified threads into a workload that can run efficiently on the
Intel/AMD provided 8-16 CPU MultiCore chips.
---++ Hardware User Level Threads
So, let me pitch an evolutionary step: basically Burton Smith style
Instruction set extensions that allow the programmer to specify
thousands of threads. And a hardware scheduler which timeslices them
onto a MultiCore - or a ManyCore with fewer cores than the user has
specified. Switch on event, switch on cache miss, etc.
A consideration which is irrelevant to academics, but matters to
industry: use this hardware scheduler behind the back of the OS.
Microsoft isn't going to ship a 100 CPU scheduler for consumer devices
next year. But if the consumer OS thinks there are only 4 or 8 CPUs,
we can let the hardware scheduler (a) use as many CPU cores as are
actually available, and (b) timelsice the still larger number of user
threads onto the hardware cores.
This way we can get programmers starting to use 100s and 1000s of
threads before the actual hardware is available. Once the highly
threaded software is available, then it is a question of "How many
hardware threads / cores run this highly threaded application best?"
You said it: Google is in the right position to take advantage of
I see the computer industry as a 3 way competition:
The PC guys: Intel/AMD/Microsoft
versus the GPU guys Nvidia (and maybe AMD/ATI)
versus Google type centralized services
2 of these corners may gang up on the third. Or maybe everyone will
With cell phones and Cisco sitting out there somewhere, waiting to