# Sparse Linear Algebra

### From View

## Contents |

### Description

Sparse matrix algorithms are used when input matrices have such a large number of zero entries that it becomes advantageous, for storage or efficiency reasons, to “squeeze” them out of the matrix representation. Compressed data structures, keeping only the non-zero entries and their indices, are the norm here. As an example, consider sparse matrix-vector multiplication. In the standard compressed row format, the code is

do i=1,n do j=row_start(i),row_start(i+1)-1 y(i) = y(i) + val(j)*x(col_index(j)) enddo enddo

### Uniprocessor Mapping

The codes are quite complex with graph algorithm components that attempt to exploit the structure of the non-zeros. In addition there are a high number of integer operations and indexed accesses. Modern algorithms on cache-based machines also attempt to take advantage of dense blocks so that the high-performing Level 3 operations can be used.

### Parallel Mapping

The dependence structure of many algorithms is extremely intricate because of the desire to only compute on non-zero entries.

### Elaboration

Variations:

o Row Compressed Format (CSR): Compressed Sparse Row

o Block-Compressed Sparse Row (BCSR)

Examples:

o SpMV

o NAS-CG: (http://www.nas.nasa.gov/Software/NPB/)

o OSKI (bebop.cs.berkeley.edu/oski/)

o SuperLU (crd.lbl.gov/~xiaoye/SuperLU/) (http://crd.lbl.gov/~oliker/papers/siamreview02.pdf)

AutoTuning:

o OSKI (bebop.cs.berkeley.edu/oski/)

Specialized Languages:

o Possibly MatLab? (not obvious)

Specialized Hardware:

o Scatter-gather vector architectures are a good example of hardware that works well for sparse linear algebra.

How Current Computer Architectures Fail:

o Block Sizes: The BeBoP group does automatic searches for the optimal blocking arrangement for BCSR matrix encoding. (as shown by Jim Demmel’s diagrams in Dave’s slides).

o Stream Prefetch and Stanzas: For the CSR (row-compressed) sparse matrix encoding, it typically takes a comparatively long stream of consecutive stride-one memory accesses before the stream prefetch fully engages (as measured by the STRIAD benchmark in http://crd.lbl.gov/~oliker/papers/msp_2005.pdf). The result is poor efficiency for the CSR encoding unless explicit prefetches are inserted.