Compiling Data-Parallel Programs For Efficient Execution on Shared

著者: Siddhartha Chatterjee
タイトル: Compiling Data-Parallel Programs For Efficient Execution on Shared_memory Multiprocessors
日時: October 1991
概要: Data parallelism is well-suited from algorithmic, architectural, and linguistic considerations to serve as a basis for portable parallel programming. However, the fine-grained parallelism characteristic of data- parallel programs makes the efficient implementation of such languages on MIMD machines a challenging task due to the high overheads these machines incur at small grain sizes. We claim that compile-time analysis can be used to reduce these overheads, thereby allowing data-parallel code to run efficiently on MIMD machines. This dissertation reports on the design, implementation, and evaluation of an optimizing compiler for an applicative nested data-parallel language called VCODE. The target machine is the Encore Multimax, a coherent-cache shared-memory multiprocessor. The source language allows nested aggregate reductions, and permutations. Such features greatly expand the range of applications that can be cast into a data-parallel model. We present a small set of powerful compile-time techniques that reduce the overheads on MIMD machines in several ways: by increasing the grain size of the output program, by reducing synchronization and storage requirements, and by improving locality of reference. The two key ideas behind these optimizations are the symbolic analysis of loop structures, and the hierarchical clustering of the program graph, first by loop structure, and then by loop traversal patterns. This localizes synchronization and work distribution actions to well-defined points in the output code. Loop traversal patterns are then used to identify parallel loops and to eliminate unnecessary intermediate storage. The most significant aspect of the analysis techniques is that they are symbolic in nature and work in the presence of control constructs such as conditionals and recursion. A compiler has been implemented based on these ideas and has been used to compile a large number of benchmarks. The benchmark suite convers a variety of problem domains, including dense and sparse matrix operations, tree and graph algorithms, dynamic and data-dependent algorithms, and low-and medium-level image processing applications. The performance results substantiate the claim that data- parallel code can be made to run efficiently on MIMD machines, and demonstrate that compiler optimizations are essential for good performance.
カテゴリ: CMUTR

Category: CMUTR Institution: Department of Computer Science, Carnegie Mellon University Abstract: Data parallelism is well-suited from algorithmic, architectural, and linguistic considerations to serve as a basis for portable parallel programming. However, the fine-grained parallelism characteristic of data- parallel programs makes the efficient implementation of such languages on MIMD machines a challenging task due to the high overheads these machines incur at small grain sizes. We claim that compile-time analysis can be used to reduce these overheads, thereby allowing data-parallel code to run efficiently on MIMD machines. This dissertation reports on the design, implementation, and evaluation of an optimizing compiler for an applicative nested data-parallel language called VCODE. The target machine is the Encore Multimax, a coherent-cache shared-memory multiprocessor. The source language allows nested aggregate reductions, and permutations. Such features greatly expand the range of applications that can be cast into a data-parallel model. We present a small set of powerful compile-time techniques that reduce the overheads on MIMD machines in several ways: by increasing the grain size of the output program, by reducing synchronization and storage requirements, and by improving locality of reference. The two key ideas behind these optimizations are the symbolic analysis of loop structures, and the hierarchical clustering of the program graph, first by loop structure, and then by loop traversal patterns. This localizes synchronization and work distribution actions to well-defined points in the output code. Loop traversal patterns are then used to identify parallel loops and to eliminate unnecessary intermediate storage. The most significant aspect of the analysis techniques is that they are symbolic in nature and work in the presence of control constructs such as conditionals and recursion. A compiler has been implemented based on these ideas and has been used to compile a large number of benchmarks. The benchmark suite convers a variety of problem domains, including dense and sparse matrix operations, tree and graph algorithms, dynamic and data-dependent algorithms, and low-and medium-level image processing applications. The performance results substantiate the claim that data- parallel code can be made to run efficiently on MIMD machines, and demonstrate that compiler optimizations are essential for good performance. Number: CMU-CS-91-189 Bibtype: TechReport Month: oct Author: Siddhartha Chatterjee Title: Compiling Data-Parallel Programs For Efficient Execution on Shared_memory Multiprocessors Year: 1991 Address: Pittsburgh, PA Super: @CMUTR