Dense Linear Algebra
These are the classic vector and matrix operations, traditionally divided into Level 1 (vector/vector), Level 2 (matrix/vector), and Level 3 (matrix/matrix) operations. Data is typically laid out as a contiguous array and computations on elements, rows, columns, or matrix blocks are the norm. For example, matrix multiplication is commonly expressed as
do i=1,n do j=1,n do k=1,n a(i,j) = a(i,j) + b(i,k)*c(k,j) enddo enddo enddo
These codes are a very natural match for vector machines with data accesses mostly having unit and constant strides. For cache based machines, the development of block algorithms has resulted in performance for Level 3 operations that can approach the peak speed of the machine. Thus, many algorithms have been recast in order to exploit matrix/matrix operations as much as possible.
In addition to memory hierarchy issues, data distribution for load balance becomes a priority here. For this reason 2-d block cyclic distributions have become the preferred ones for many matrix algorithms. Further, the use of computation/communication overlap has permitted these algorithms to scale very well, as exhibited by the tremendous performance of the Top 500 High Performance Linpack benchmark.
Original text by Parry Husbands.