Sony/Toshiba/IBM Cell Processor
Purpose and Target Market
General Purpose? Special Purpose? For gaming or telecom?
The Cell processor is designed to be the heart of Sony's Playstation3 game console where it will perform the physics and serve a graphics list to the RSX GPU. If PS2 sales are any indication, millions of machines will be sold the first year.
In addition, Los Alamos National Lab has announced it will aquire a 16,000 processor Cell and Opteron system. Whether this system's benefit is heterogenity or from a common form factor, power supply, and management functions is yet to be seen.
Basic Processing Element(s)
Cell is composed of nine processing elements: a standard PowerPC core, and eight SIMD cores. In addition, there is a dual XDR memory controller, and two I/O controllers. The processing cores, as of rev3, run at 3.2GHz.
The PPE (Power Processing Element) is a 64b, dual issue, dual thread PowerPC core. It provides a full double precision FMA datapath (6.4GFlop/s), and a full Altivec datapath, which in single precision provides 25.6GFlop/s. Unlike many SMT cores, the threads alternate issue cycles.
Each SPE (Synergistic Processing Element) includes one MFC (memory flow controller), and one SPU (Synergistic Processig Unit). Each SPU includes a 256KB local store (a memory disjoint from the DRAM address space), two in order SIMD datapaths, and a 128x128b register file.
Each SPU has its own program counter, and can only fetch instructions from its local store. It may issue up to two SIMD instructions per cycle if:
- they are correctly packed into a 128b quad word
- one is a integer, bitwise, or single precision floating point SIMD instruction
- the other is a load, store, permute, branch or channel instruction.
The single precision SIMD datapaths are fully pumped (four 32b FMAs per cycle) and can deliver up to 25.6GFlop/s. However, they do have a 6 cycle latency necessitating significant unrolling.
All loads and stores operate on quadword (128b) granularities and only may access the local store with a 32b local address space. All scalar loads and stores must be implemented in software via permute instructions.
The double precision pipeline at 13 cycles is significantly longer than the single precision datapath. IBM appears to have chosen a rather straightforward approach to inserting a 13 cycle pipeline into a 7 cycle forwarding network : it stalls subsequent issues for 6 cycles. Thus only one SIMD double precision floating point instruction can be executed every 7 cycles - 1.83GFlop/s.
As previously discussed all loads, stores, and instruction fetches may only access the local store. It is the MFC's responsibility to move data in and out of the corresponding SPU's local store. Thus the primary difference between the PPE and the SPEs is not performance, but productivity via a conventional programming model.
Interconnect and Topologies
All elements on the Cell chip are connected via the EIB (Element interconnect bus) which is composed of four 128b rings running at 1.6GHz. Two rings run in one direction, two run in the other. There are restrictions as to which ring data may be inserted into based on the source and destination of the data item. As such the latency and bandwidth is dependent on the communication pattern.
Memory Structure and Hierarchy
Shared memory? Distributed memory? Caches? Scratch Spaces?
Special Purpose Hardware Units
Vector units? Crypto units?
I/O and Peripherals
Memory controller? DMA engines? Ethernet Controller? Hypertransport?
Whose fab? which year? Layers of metal? Fast or slow process? multi-Vt process?
The Cell die size is about 220mm^2. Roughly speaking,
- Power core is about 10%
- the L2 is about 10%
- the SPEs are about 7% each
IBM has been a bit cagy about releasing power consumption. Publicly they have stated that their dual chip blades consume about 315W, with an additional 15W per infiniband. Depending on temperature, SPE power is between 3 and 7W, and rough estimates have placed full chip power at about 100W.
Two seperate programs are written. First, the explicitly software controlled memory SPE program is written. It may be necessary to use intrinsics in the inner loops to fully exploit the SPEs computational capabilities. The SPE program is then embedded within a standard PowerPC program. The PowerPC program creates SPE threads passing a pointer the embedded SPE program to them. All 10 threads operate independently, and are explicitly synchronized by the programmer.
Software Development Environment
System design tool stack? Availability of layers in this tool stack?