Purpose and Target Market
Double-precision floating-point accelerator for high-performance technical computing.
The CSX architecture is a family of processors based on ClearSpeed’s multi-threaded array processor (MTAP) core. The architecture has been developed for high performance, high rate processing. CSX processors can be used as application accelerators, alongside general-purpose processors such as those from Intel or AMD.
Execution and control
The MTAP consists of execution units and a control unit. One part of the processor forms the mono execution unit, dedicated to processing scalar (or mono) data. Another part forms the poly execution unit, which processes parallel (or poly) data, and may consist of tens, hundreds or even thousands of identical processing element (PE) cores. This array of PE cores operates in a synchronous, Single Instruction Multiple Data (SIMD) manner, where every PE core executes the same instruction on its piece of data.
The control unit fetches instructions from a single instruction stream, decodes and dispatches them to the execution units or I/O controllers. Instructions for the mono and poly execution units are handled similarly, except for conditional execution. The mono unit uses conditional jumps to branch around code like a standard RISC architecture. This affects both mono and poly operations. The poly unit uses an enable register to control execution of each PE. If one or more of the bits of that PE enable register is zero, then the PE core is disabled and most instructions it receives will be ignored. The enable register is a stack, and a new bit, specifying the result of a test, can be pushed onto the top of the stack allowing nested predicated execution. The bit can later be popped from the top of the stack to remove the effect of that condition. This makes handling nested conditions and loops efficient.
Memory Structure and Hierarchy
In order to provide fast access to the data being processed, each PE core has its own local memory and register file. Each PE core can directly access only its own storage. (Instructions for the poly execution unit having a mono register operand indirectly access the mono register file, as a mono value gets broadcast to each PE.) Data is transferred between PE (poly) memory and poly register file via load/store instructions. The mono unit has direct access to main (mono) memory. It also uses load/store instructions to transfer data between mono memory and mono register file. Programmed I/O (PIO) extends the load/store model: it is used for transfers of data between mono memory and poly memory.
The CSX600 is the first product in the CSX family. The processor is optimised for intensive double-precision floating-point computations, providing sustained 33 GFLOPS of performance, while dissipating an average of 10 watts. The poly execution unit is a linear array of 96 PE cores, with 6KB SRAM and a superscalar 64-bit FPU on each PE core. The PE cores are able to communicate with each other via what is known as swazzle path that connects each PE with its left and right neighbours.
IBM 130nm FSG process, 8-layer metal (copper)
15 mm x 15 mm
Number of Transistors
128 million transistors
Sustained 33 GFLOPS of performance (DGEMM), with an average power dissipation of 10 watts.
Cn is a data-parallel extension to the C language for programming the CSX architecture. In Cn, parallelism is mainly expressed at the type level rather than at the code level.
Essentially, Cn introduces a new multiplicity type qualifier poly which implies that each PE has its own copy of a value. For example, the definition poly int X; implies that, on the CSX600 with 96 PEs, there exist 96 copies of integer variable X, each having the same address within its PE’s local storage. The multiplicity is also manifested in conditional statements. For example, the following code alters the value of X on every even PE (the runtime function get_penum() returns the ordinal number of a PE):
if(get_penum()%2 == 0) X = 0;
On every odd PE, the assignment is not executed (this is equivalent to issuing a NOP instruction, as the SIMD array operates in lock-step).
See International Workshop on OpenMP 2007 .
Software Development Environment
Cn compiler, assembler, libraries, ddd/gdb-based debugger, visual profiler, newlib-based C-rtl etc.
The CSXL acceleration library intercepts and accelerates calls to functions in the Basic Linear Algebra Subprograms (BLAS) library. These include Level 3 BLAS DGEMM calls and LAPACK DGETRF calls.
The CSFFT Library provides a set of fast Fourier transforms (FFT), inverse FFT functions and convolution functions.