Data is arranged in a regular multidimensional grid (most commonly 2D or 3D, sometimes 4D, but rarely higher). Computation proceeds as a sequence of grid update steps. At each step, all points are updated using values from a small neighborhood around each point. The general form of a structured grid computation is as follows.
An example of a structured grid computation is the Jacobi method for solving the Poisson Problem, where u is the updated grid and f is a constant grid:
The updates are logically concurrent, but in practice are implemented as a sequential sweep through the computational domain. Updates may be in place, overwriting the previous version, or use two versions of the grid. In many cases the updates can be performed in parallel, if two versions of the grid are used or updates alternate (e.g., using a red-black pattern) so that values being updated are not being used as neighbors in the same step. These codes have a high degree of parallelism, and data access patterns are regular and statically determinable.
The code is highly vectorizable, using unit-stride or constant-stride memory accesses. The points can be visited in an order (both within one update step and across multiple sequential update steps) that provides spatial locality to make good use of long cache lines and temporal locality to allow cache reuse. Spatial locality is very high, and either hardware or software prefetching is effective given the predictable addressing pattern. Temporal locality is more limited. In one update step, the amount of temporal locality depends on the size of the neighborhood, as each data value is accessed once by each neighborhood that contains it. For large meshes, some neighboring values may not be preserved in cache due to cache-line aliasing, and even if they are preserved, each value is used only a constant number of times. (E.g., a point in a 3D mesh with a 6 immediate neighbors may have only 7 arithmetic operations for each point loaded from main memory.) In some cases, there is additional temporal locality across update steps, as the value of d tt may depend only on the values of d t in its neighborhood, and so a subset of the grid can be advanced multiple time steps using only locally produced data, although this complicates logic for loop control and address generation. This is somewhat rare in practice, as most algorithms require other computations, such as updates to boundary values or other data structures, between steps.
Each processor can be statically assigned a contiguous subgrid, and can perform each update step locally and independently of other nodes. Each node only has to communicate and synchronize with neighboring nodes on the grid, exchanging data from the boundary of their sub-grids at the end of each step. The parallel efficiency is determined by the surface to volume ratio, and the width of the surface is determined by the width of the neighborhood surrounding each point. The redundant cells at the outer boundaries are termed “ghost zones”. Latency can be hidden by increasing the number of ghost-zones on the outer boundaries and exchanging more data less frequently.
Variant: Adaptive Mesh Refinement
The grid may be refined by overlaying higher-resolution grids across areas of interest, and the refinement process can proceed recursively and adaptively over time. The logical neighborhood of a given point may be on another grid patch at the same level of refinement or at a different level, thereby complicating indexing and reducing spatial locality. A significant amount of time, if not computational work, may be spent on the boundaries of all the patches and the prolongation and restriction operations that interpolate the data from coarse grids to refined grids and back again. On a parallel machine, this requires some form of dynamic load balancing to assign roughly equal amounts of work to each processor while minimizing the communication between them.
Original text by Krste Asanovic. Additions by Kathy Yelick, John Shalf, and Ras Bodik.