Changeset 451
- Timestamp:
- 02/15/09 23:38:08 (1 day ago)
- Files:
-
- orio/doc/papers/ipdps09/abstract.tex (modified) (1 diff)
- orio/doc/papers/ipdps09/final-version.pdf (modified) (previous)
- orio/doc/papers/ipdps09/implementation.tex (modified) (4 diffs)
- orio/doc/papers/ipdps09/integration.tex (modified) (3 diffs)
- orio/doc/papers/ipdps09/motivation.tex (modified) (1 diff)
- orio/doc/papers/ipdps09/notes.txt (modified) (1 diff)
- orio/doc/papers/ipdps09/paper.bib (modified) (10 diffs)
- orio/doc/papers/ipdps09/paper.tex (modified) (1 diff)
- orio/doc/papers/ipdps09/related.tex (modified) (2 diffs)
- orio/doc/papers/ipdps09/results.tex (modified) (3 diffs)
Legend:
- Unmodified
- Added
- Removed
- Modified
- Copied
- Moved
orio/doc/papers/ipdps09/abstract.tex
r427 r451 1 Inmany scientific applications, significant time is spent in tuning1 For many scientific applications, significant time is spent in tuning 2 2 codes for a particular high-performance architecture. Tuning 3 3 approaches range from the relatively nonintrusive (e.g., by using 4 4 compiler options) to extensive code modifications that attempt to 5 5 exploit specific architecture features. Intrusive techniques often 6 result in code changes that are not easily reversible, whichcan6 result in code changes that are not easily reversible, and can 7 7 negatively impact readability, maintainability, and performance on 8 8 different architectures. We introduce an extensible annotation-based 9 empirical tuning system called Orio , whichis aimed at improving both10 performance and productivity by enablingsoftware developers to insert9 empirical tuning system called Orio that is aimed at improving both 10 performance and productivity. It allows software developers to insert 11 11 annotations in the form of structured comments into their source code 12 t hattrigger a number of low-level performance optimizations on a12 to trigger a number of low-level performance optimizations on a 13 13 specified code fragment. To maximize the performance tuning 14 14 opportunities, the annotation processing infrastructure is designed to orio/doc/papers/ipdps09/implementation.tex
r443 r451 50 50 as much as possible). 51 51 % 52 Finally, Orio is usable inreal scientific applications without52 Finally, Orio is usable with real scientific applications without 53 53 requiring reimplementation. This ensures that the significant investments in 54 54 the development of complex scientific codes is leveraged to the greatest … … 72 72 for each distinct combination of performance parameter values. Then each 73 73 optimized code version is executed and its performance evaluated. After 74 iteratively evaluating all code variants, the best-performing code is picked74 iteratively evaluating all code variants, the best-performing code is selected 75 75 as the final output of Orio. Because the search space of all possible 76 76 optimized code versions can be huge, a brute force search strategy is not … … 89 89 90 90 Orio annotations are embedded into the source code as structured C comments 91 that alwaysstart with \texttt{/*@} and end with \texttt{@*/}. For example,91 that start with \texttt{/*@} and end with \texttt{@*/}. For example, 92 92 the annotation \texttt{/*@ end @*/} is used to indicate the end of an 93 93 annotated code region. A simple grammar illustrating the basic syntax of Orio 94 94 annotations is depicted in Figure~\ref{fig:ann-lang}. An annotation region 95 95 consists of three main parts: leader annotation, annotation body, and trailer 96 annotation. The annotation body can be eitherempty or contain C code that96 annotation. The annotation body can either be empty or contain C code that 97 97 possibly includes other nested annotation regions. A leader annotation 98 98 contains the name of the code transformation module used to transform and … … 342 342 343 343 Current optimizations supported by Orio span different types of code 344 transformation sthat are not provided by production compilers in some344 transformation that are not provided by production compilers in some 345 345 cases. Available optimizations include simple loop unrolling, memory 346 346 alignment optimization, loop unroll/jamming, loop tiling, loop permutation, orio/doc/papers/ipdps09/integration.tex
r443 r451 39 39 optimization parameters code. The loop code is embedded in the 40 40 annotation body block, while the performance parameters code is 41 written in the module body block. A simpleparser implemented inside41 written in the module body block. A parser implemented inside 42 42 the new module parses the performance parameters code to extract their 43 43 values. The extracted values of the performance parameters include … … 58 58 described in Section~\ref{sec:implm}. 59 59 60 Pluto also performs syntactic loop unroll/jam ming asa post-processing60 Pluto also performs syntactic loop unroll/jam in a post-processing 61 61 pass. The target loops, however, are limited only to innermost loops with a 62 62 maximum depth of two (i.e., 1-D unrolling and 2-D unroll/jamming). So we 63 choose to use the loop unroll/jam mingtransformation already available in63 choose to use the loop unroll/jam transformation already available in 64 64 Orio, which is more flexible because it can be applied to loop nests of 65 65 depths larger than two. … … 67 67 68 68 69 At present Orio does not employ any data dependenc yanalysis when performing69 At present Orio does not employ any data dependence analysis when performing 70 70 syntactic transformations. Therefore, there is no guarantee that the code 71 71 generated after syntactic transformations is correct. Because of this, the orio/doc/papers/ipdps09/motivation.tex
r428 r451 28 28 loop-level optimizations. Manual tuning is time-consuming and impedes 29 29 readability and performance portability. Tuned libraries often deliver 30 great performance without requiring significant programming effort,30 excellent performance without requiring significant programming effort, 31 31 but can only provide limited functionality. General-purpose source 32 32 transformation tools for performance optimizations are few and have orio/doc/papers/ipdps09/notes.txt
r445 r451 5 5 @INPROCEEDINGS{Hart0000:Orio, 6 6 AUTHOR="Albert Hartono and Boyana Norris and Ponnuswamy Sadayappan", 7 TITLE="Annotation-Based Empirical Performance Tuning Using Orio",7 TITLE="Annotation-Based Empirical Performance Tuning Using {Orio}", 8 8 BOOKTITLE="23rd IEEE International Parallel \& Distributed Processing Symposium", 9 9 ADDRESS="Rome, Italy", 10 10 KEYWORDS="empirical performance tuning, code generation", 11 ABSTRACT="In many scientific applications, significant time is spent in tuning codes 12 for a particular high-performance architecture. Tuning approaches range 13 from the relatively nonintrusive (e.g., by using compiler options) to 14 extensive code modifications that attempt to exploit specific architecture 15 features. Intrusive techniques often result in code changes that are not 16 easily reversible, which can negatively impact readability, 17 maintainability, and performance on different architectures. We introduce 18 an extensible annotation-based empirical tuning system called Orio, which 19 is aimed at improving both performance and productivity by enabling 20 software developers to insert annotations in the form of structured 21 comments into their source code that trigger a number of low-level 22 performance optimizations on a specified code fragment. To maximize the 23 performance tuning opportunities, the annotation processing infrastructure 24 is designed to support both architecture-independent and 25 architecture-specific code optimizations. Given the annotated code as 26 input, Orio generates many tuned versions of the same operation and 27 empirically evaluates the alternatives to select the best performing 28 version for production use. We have also enabled the use of the Pluto 29 automatic parallelization tool in conjunction with Orio to generate 30 efficient OpenMP-based parallel code. We describe our experimental results 11 ABSTRACT="For many scientific applications, significant time is spent in tuning 12 codes for a particular high-performance architecture. Tuning 13 approaches range from the relatively nonintrusive (e.g., by using 14 compiler options) to extensive code modifications that attempt to 15 exploit specific architecture features. Intrusive techniques often 16 result in code changes that are not easily reversible, and can 17 negatively impact readability, maintainability, and performance on 18 different architectures. We introduce an extensible annotation-based 19 empirical tuning system called Orio that is aimed at improving both 20 performance and productivity. It allows software developers to insert 21 annotations in the form of structured comments into their source code 22 to trigger a number of low-level performance optimizations on a 23 specified code fragment. To maximize the performance tuning 24 opportunities, the annotation processing infrastructure is designed to 25 support both architecture-independent and architecture-specific code 26 optimizations. Given the annotated code as input, Orio generates many 27 tuned versions of the same operation and empirically evaluates the 28 alternatives to select the best performing version for production 29 use. We have also enabled the use of the Pluto automatic 30 parallelization tool in conjunction with Orio to generate efficient 31 OpenMP-based parallel code. We describe our experimental results 31 32 involving a number of computational kernels, including dense array and 32 sparse matrix operations." 33 } 33 sparse matrix operations.} 34 34 35 35 orio/doc/papers/ipdps09/paper.bib
r449 r451 265 265 PAGES = {308-323} } 266 266 267 @BOOK{laug, 268 AUTHOR = {E. Anderson and Z. Bai and C. Bischof and J. Demmel and 269 J. Dongarra and J. Du Croz and A. Greenbaum and S. Hammarling 270 and A. McKenney and S. Ostrouchov and D. Sorensen}, 271 TITLE = {{LAPACK} Users' Guide}, 272 EDITION = {Second}, 273 PUBLISHER = siam, 274 YEAR = {1995}, 275 ADDRESS = {Philadelphia, PA} } 267 @book{laug, 268 author ={Anderson,, E. and Bai,, Z. and Bischof,, C. and Demmel,, J. and Dongarra,, J. and Du Croz,, J. and Greenbaum,, A. and Hammarling,, S. and McKenney,, A. and Ostrouchov,, S. and Sorensen,, D.}, 269 title ={LAPACK's user's guide}, 270 year ={1992}, 271 isbn ={0-89871-294-7}, 272 publisher ={Society for Industrial and Applied Mathematics}, 273 address ={Philadelphia, PA, USA}, 274 } 276 275 277 276 @TECHREPORT{lawn111, … … 409 408 author = {Ken Kennedy and others}, 410 409 title = {Telescoping Languages Project Description}, 411 howpublished = {\url{ telescoping.rice.edu}},410 howpublished = {\url{http://telescoping.rice.edu/}}, 412 411 year = 2006 413 412 } … … 418 417 title = {Telescoping Languages: A System for Automatic Generation of Domain Languages}, 419 418 year = {2005}, 420 URL = {http://ieeexplore.ieee.org/xpl/abs\_free.jsp?arNumber=1386658 },421 419 abstract = { The software gap - the discrepancy between the need for new software and the aggregate capacity of the workforce to produce it - is a serious problem for scientific software. Although users appreciate the convenience (and, thus, improved productivity) of using relatively high-level scripting languages, the slow execution speeds of these languages remain a problem. Lower level languages, such as C and Fortran, provide better performance for production applications, but at the cost of tedious programming and optimization by experts. If applications written in scripting languages could be routinely compiled into highly optimized machine code, a huge productivity advantage would be possible. It is not enough, however, to simply develop excellent compiler technologies for scripting languages (as a number of projects have succeeded in doing for MATLAB). In practice, scientists typically extend these languages with their own domain-centric components, such as the MATLAB signal processing toolbox. Doing so effectively defines a new domain-specific language. If we are to address efficiency problems for such extended languages, we must develop a framework for automatically generating optimizing compilers for them. To accomplish this goal, we have been pursuing an innovative strategy that we call telescoping languages. Our approach calls for using a library-preprocessing phase to extensively analyze and optimize collections of libraries that define an extended language. Results of this analysis are collected into annotated libraries and used to generate a library-aware optimizer. The generated library-aware optimizer uses the knowledge gathered during preprocessing to carry out fast and effective optimization of high-level scripts. This enables script optimization to benefit from the intense analysis performed during preprocessing without repaying its price. Since library preprocessing is performed only at infrequent "language-generation" times, its cost is amortized over many compilations of individual scripts that use the library. We call this strategy "telescoping languages" because it merges knowledge of a hierarchy of extended languages into a single library-aware optimizer. We present our vision and plans for compiler frameworks based on telescoping languages and - report on the preliminary research that has established the effectiveness of this approach. }, 422 420 journal = {Proceedings of the IEEE}, … … 961 959 pages = {156--165}, 962 960 year = {1993}, 963 url = "citeseer.ist.psu.edu/weise93programmable.html"964 961 } 965 962 … … 1089 1086 pages = "340--347", 1090 1087 year = "1997", 1091 url = "citeseer.ist.psu.edu/article/bilmes97optimizing.html"}1088 } 1092 1089 1093 1090 @misc{ vuduc03memory, … … 1420 1417 } 1421 1418 1422 @InProceedings{POET, 1419 @inproceedings{POET, 1420 author = {Q. Yi and K. Seymour and H. You and R. Vuduc and D. Quinlan}, 1421 booktitle = {Workshop on Performance Optimization of High-Level Languages and Libraries (POHLL)}, 1422 title = {{POET}: Parameterized Optimizations for Empirical Tuning}, 1423 publisher={IEEE Computer Society}, 1424 pages = {1--8}, 1425 location = {Long Beach, CA}, 1426 month = {March}, 1427 year = {2007} 1428 } 1429 1430 1431 @InProceedings{POET1, 1423 1432 author = {Qing Yi and Keith Seymour and Haihang You and Richard Vuduc and Dan Quinlan}, 1424 1433 title = {{POET}: Parameterized Optimizations for Empirical Tuning}, 1425 1434 booktitle = {Proceedings of the Parallel and Distributed Processing Symposium, 2007}, 1426 1435 year = {2007}, 1427 publisher={IEEE },1436 publisher={IEEE Computer Society}, 1428 1437 pages = {1--8}, 1429 DOI={10.1109/IPDPS.2007.370637},1430 1438 location = {Long Beach, CA}, 1431 1439 month = {Mar} … … 1776 1784 } 1777 1785 1778 @inproceedings{Siek:1998ys, 1779 Author = {Jeremy G. Siek and Andrew Lumsdaine}, 1780 Booktitle = {Parallel Object Oriented Scientific Computing}, 1781 Date-Added = {2006-09-26 11:13:25 -0600}, 1782 Date-Modified = {2007-05-18 13:42:22 -0600}, 1783 Organization = {ECOOP}, 1784 Title = {A Rational Approach to Portable High Performance: The {Basic Linear Algebra Instruction Set (BLAS)} 1785 and the {Fixed Algorithm Size Template (FAST)} Library}, 1786 Year = 1998} 1786 @InProceedings{Siek:1998ys, 1787 title ="A Rational Approach to Portable High Performance: The 1788 Basic Linear Algebra Instruction Set ({BLAIS}) and the 1789 Fixed Algorithm Size Template ({FAST}) Library", 1790 author ="Jeremy G. Siek and Andrew Lumsdaine", 1791 bibdate ="2002-08-04", 1792 bibsource ="DBLP, http://dblp.uni-trier.de/db/conf/ecoopw/ecoopw98.html#SiekL98a", 1793 booktitle ="ECOOP Workshops", 1794 booktitle ="Object-Oriented Technology, {ECOOP}'98 Workshop 1795 Reader, {ECOOP}'98 Workshops, Demos, and Posters, 1796 Brussels, Belgium, July 20-24, 1998, Proceedings", 1797 publisher ="Springer", 1798 year = "1998", 1799 volume ="1543", 1800 editor ="Serge Demeyer and Jan Bosch", 1801 ISBN = "3-540-65460-7", 1802 pages ="468--469", 1803 series ="Lecture Notes in Computer Science", 1804 } 1787 1805 1788 1806 @InCollection{Hall05, … … 1839 1857 pages = "490--503", 1840 1858 series = "Lecture Notes in Computer Science", 1841 URL = "http://dx.doi.org/10.1007/978-3-540-71351-7_38",1842 1859 } 1843 1860 … … 1889 1906 } 1890 1907 1891 @phdthesis{vuduc-thesis, 1892 author = {Richard Wilson Vuduc}, 1893 note = {Chair-James W. Demmel}, 1894 title = {Automatic performance tuning of sparse matrix kernels}, 1895 year = {2003}, 1896 order_no = {AAI3121741}, 1897 publisher = {University of California, Berkeley}, 1898 } 1908 @PhDThesis{vuduc-thesis, 1909 author = {Richard W. Vuduc}, 1910 title = {Automatic performance tuning of sparse matrix kernels}, 1911 school = {University of California, Berkeley}, 1912 month = {December}, 1913 year = {2003} 1914 } 1899 1915 1900 1916 @inproceedings{williams-sc07, … … 1963 1979 } 1964 1980 1965 @article{gprof, 1966 author = {Susan L. Graham and Peter B. Kessler and Marshall K. McKusick}, 1967 title = {Gprof: {A} call graph execution profiler}, 1968 journal = {SIGPLAN Not.}, 1969 volume = {39}, 1970 number = {4}, 1971 year = {2004}, 1972 issn = {0362-1340}, 1973 pages = {49--57}, 1974 doi = {http://doi.acm.org/10.1145/989393.989401}, 1975 publisher = {ACM}, 1976 address = {New York, NY, USA}, 1977 } 1981 @Article{gprof, 1982 author ="S. L. Graham and P. B. Kessler and M. K. McKusick", 1983 title ="gprof: {A} Call Graph Execution Profiler", 1984 journal ="SIGPLAN Notices", 1985 volume ="17", 1986 number ="6", 1987 pages ="120", 1988 month =jun, 1989 year ="1982", 1990 keywords ="UNIX utility performance", 1991 } 1978 1992 1979 1993 orio/doc/papers/ipdps09/paper.tex
r433 r451 70 70 many productive discussions and his valuable help with Pluto. This 71 71 work was supported in part by the U.S. Dept. of Energy under Contract 72 DE-AC02-06CH11357. 72 DE-AC02-06CH11357, the National Science Foundation through awards 73 0403342, 0509467, and 0811781, and a State of Ohio Development Fund. 73 74 74 75 \small orio/doc/papers/ipdps09/related.tex
r437 r451 51 51 digital signal processing libraries by an extensive empirical search 52 52 over implementation variants. GotoBLAS~\cite{Goto:2006fk,Goto:fk}, on 53 the other hand, achieves great performance resultson several53 the other hand, achieves near-peak performance on several 54 54 architectures by using hand-tuned data structures and kernel 55 55 operations. These auto- or hand-tuned approaches can deliver … … 82 82 and thus the interfaces are not necessarily based on concepts accessible to 83 83 computational scientists. The complexity of existing annotation languages and lack 84 of common syntax esfor transformations (e.g., loop unrolling) result84 of common syntax for transformations (e.g., loop unrolling) result 85 85 in steep learning curves and the inability to take advantage of more than one 86 86 approach at a time. Furthermore, at present, there is no good way for orio/doc/papers/ipdps09/results.tex
r444 r451 14 14 on the Xeon and Blue Gene/P, respectively. Because of space considerations, 15 15 here we discuss a limited number of computational kernels; more performance 16 results for different computations areavailable at the Orio project16 data for different computations is available at the Orio project 17 17 website~\cite{OrioURL}. 18 18 … … 117 117 The SpMV operation computes $\forall{A_{i,j}} \neq 0:y_{i} 118 118 \leftarrow y_{i} + A_{i,j} \cdot x_{j}$, where $A$ is a sparse matrix, 119 and $x$, $y$ are dense vectors. Each element of $A$ is used precisely once,119 and $x$, $y$ are dense vectors. Each element of $A$ is used only once, 120 120 and element reuse is possible only for $x$ and $y$. Thus, to optimize SpMV, 121 121 one should use compact data structures for $A$ and try to maximize temporal … … 279 279 facilitate OpenMP parallelization. The SMP and VN results show that 280 280 optimizations using loop-level parallelism (through OpenMP) achieve much 281 better performance than using MPI coarse-grainedparallelism.281 better performance than using coarse-grained MPI parallelism. 282 282 %Moreover, for the multithread cases, tuning using Orio determined that 283 283 %the performance benefit of SIMDization is more significant than that of