Many problems can be described in the form of updates on an irregular mesh or grid, with each grid element being updated from neighboring grid elements. When problem sizes are large, and the irregularity of the data causes problematic data distribution and access requirements, how can we efficiently perform these computations on parallel computers?
Many modeling problems involve objects with irregular geometric definitions. The dataset is typically defined as a mesh covering the surfaces or volumes of these objects, and mesh granularity depends on local properties of the problem domain. Entities in the mesh (i.e., points, edges, faces, and/or volumes) must be explicitly represented, usually using multiple tables, one for each entity type, linked by pointers or integer offsets. Computations frequently involve the numerical solution of differential equations. Applications typically require the modeling of quantities such as tension, temperature, or pressure at each point in the grid, evolving over time. Computation proceeds as a sequence of mesh update steps. At each step, values associated with each entity are updated in parallel, based on values retrieved from neighboring entities. The general form of the computation accesses neighboring elements, as in the structured grid, but the neighbors could be either points, or edges, or faces, or volumes. Datasets can be very large, involving millions of grid points, at each of which the application must perform time-stepped updates based on nearest-neighbor values.
These codes have a high degree of parallelism, but data accesses require indirection, which adds memory for the indexing information and results in poor spatial locality, as illustrated in the following example code:
The mesh structures may be static or they may change dynamically due to changes in the physical object being simulated, but even in the dynamic scenario the structure may remain fixed for several steps.
The demand for precision in the quantities being modeled pushes towards finer meshes and smaller time-steps, but the dataset size and computational requiremets can grow rapidly.
The massively data-parallel nature of the updates to the mesh encourages massive parallelization, but the irregular nature of the data complicates data distribution and orchestration of inter-processor communication.
The mesh of data can be represented as a graph whose edges represent the geometric nearest-neighbor relationship between mesh points. Typically, communication patterns in mesh problems follows this nearest-neighbor relationship. That is, the value of a mesh point in the future will depend on its own value and the values of its geometric neighborhood. Partitioning such a nearest-neighbor graph to minimize the number of cross-partition edges (total edge cut) is equivalent to finding a communication-minimizing data distribution. The updates to a processor's sub-mesh will depend on the sub-meshes of processors containing its nearest-neighbors. Since the size of sub-meshes (i.e. the number of grid-points per sub-mesh) determines the amount of work each processor is assigned, and thus the load balance of the computation, it is important for each processor's subdomain to be of approximately equal size. Several software packages exist for solving this graph partitioning problem, possibly the most notable being Metis and Parmetis [.
For many problems, a statically defined mesh is sufficient to achieve the necessary numerical stability. For some highly dynamic computations, however, an Adaptive Mesh Refinement (AMR) scheme is necessary to maintain stability. The additional complexity of these codes is frequently difficult to justify. [cite someone's AMR work] Not only is the development and verification cost very high, but the dynamic nature of an AMR mesh further complicates the data distribution and load balancing problem. Thus a static mesh is extremely desirable, if it is at all practical.
Solution of differential equations over these domains are typically considered Sparse Linear Algebra problems [Cite the sparse linear algebra pattern]
For example, the design of airfoils for airplanes requires modeling the fluid dynamics of a wing in air. In this instance, the surface of the airfoil forms is defined as a mesh of points, whose geometric properties are of import to the problem. In large, flat surfaces of the wing, the mesh can be coarse, as the dynamics on these regions will tend to be low-magnutude. At the wing-tips, however, where the surface has high curvature, a finer-granularity mesh is necessary to effectively capture the system's properties.
The code is often highly vectorizable, but requires scatter-gather memory accesses to retrieve data items. An implementation may flatten the data structure so that the neighbors of a given entity or multiple entities may be loaded with a single gather operation. Memory hierarchies are less effective on these codes. Spatial locality is usually limited to one of the entity table types at a time, as the other tables are accessed indirectly with patterns dependent on the mesh structure. There is potential for some temporal locality, as a given entity is used in the calculations for several different neighborhoods. Graph partitioners such as Metis and Zoltan or other reordering algorithms such as space-filling curves can be used to place neighboring points nearby in memory, thereby increasing locality, and careful encoding of the data structure to reduce metadata overhead can reduce memory traffic. Although the memory access patterns repeat from one mesh update to the next, each step is too long to record as a hardware prefetch pattern, and the indirection in the code makes compile-time analysis difficult. The patterns is difficult to encode more efficiently than the index structures themselves, making fully-automated hardware prefetching difficult to design. The address resolution, is typically not data-dependent for static calculations, so computations required to resolve addresses are not dependent on the computations performed on those addresses during a given iteration. A software prefetch mechanism can be used, in principle to execute the index stream because the order in which the data items are visited is static within a given iteration. However, the prefetch operations or “speculative thread” that walk ahead to resolve addresses must lead the computations by a distance equal to the product of the available memory subsystem bandwidth and latency to main memory for resolving all of the levels of indirection in order to fully saturate available bandwidth the memory interface. This relationship is specified in Little’s Law or in networking circles is referred to as the “bandwidth-delay product.” In a conventional cache-hierarchy such mechanism can still suffer from undesired cache-line evictions due to cache-line aliasing.
The mesh structure can be divided into contiguous sub-domains, where neighboring entities reside together on the same node. Communication and synchronization between nodes is mostly only to neighbors, although the neighborhood is less regular than in structured grid codes. Graph partitioners are used to simultaneously assign equal portions of the mesh to each processor while reducing the number of edges that cross between processors. A common implementation strategy is to allocate a set of ghost regions around each processors portion of the graph, and fill this in with copies of data from neighboring node before performing each computation step. The mesh partitioning is usually performed as a preprocessing step, but adaptive algorithms may require embedded mesh partitioners that operate incrementally in response to load imbalances that are detected at runtime.
Original text by Krste Asanovic. Additions by Kathy Yelick, John Shalf, and Ras Bodik.