Changeset 451

Show
Ignore:
Timestamp:
02/15/09 23:38:08 (1 day ago)
Author:
hartono
Message:

--

Files:

Legend:

Unmodified
Added
Removed
Modified
Copied
Moved
  • orio/doc/papers/ipdps09/abstract.tex

    r427 r451  
    1 In many scientific applications, significant time is spent in tuning 
     1For many scientific applications, significant time is spent in tuning 
    22codes for a particular high-performance architecture. Tuning 
    33approaches range from the relatively nonintrusive (e.g., by using 
    44compiler options) to extensive code modifications that attempt to 
    55exploit specific architecture features. Intrusive techniques often 
    6 result in code changes that are not easily reversible, which can 
     6result in code changes that are not easily reversible, and can 
    77negatively impact readability, maintainability, and performance on 
    88different architectures. We introduce an extensible annotation-based 
    9 empirical tuning system called Orio, which is aimed at improving both 
    10 performance and productivity by enabling software developers to insert 
     9empirical tuning system called Orio that is aimed at improving both 
     10performance and productivity. It allows software developers to insert 
    1111annotations in the form of structured comments into their source code 
    12 that trigger a number of low-level performance optimizations on a 
     12to trigger a number of low-level performance optimizations on a 
    1313specified code fragment.  To maximize the performance tuning 
    1414opportunities, the annotation processing infrastructure is designed to 
  • orio/doc/papers/ipdps09/implementation.tex

    r443 r451  
    5050as much as possible). 
    5151% 
    52 Finally, Orio is usable in real scientific applications without 
     52Finally, Orio is usable with real scientific applications without 
    5353requiring reimplementation. This ensures that the significant investments in 
    5454the development of complex scientific codes is leveraged to the greatest 
     
    7272for each distinct combination of performance parameter values. Then each 
    7373optimized code version is executed and its performance evaluated.  After 
    74 iteratively evaluating all code variants, the best-performing code is picked 
     74iteratively evaluating all code variants, the best-performing code is selected 
    7575as the final output of Orio. Because the search space of all possible 
    7676optimized code versions can be huge, a brute force search strategy is not 
     
    8989 
    9090Orio annotations are embedded into the source code as structured C comments 
    91 that always start with \texttt{/*@} and end with \texttt{@*/}. For example, 
     91that start with \texttt{/*@} and end with \texttt{@*/}. For example, 
    9292the annotation \texttt{/*@ end @*/} is used to indicate the end of an 
    9393annotated code region. A simple grammar illustrating the basic syntax of Orio 
    9494annotations is depicted in Figure~\ref{fig:ann-lang}. An annotation region 
    9595consists of three main parts: leader annotation, annotation body, and trailer 
    96 annotation. The annotation body can be either empty or contain C code that 
     96annotation. The annotation body can either be empty or contain C code that 
    9797possibly includes other nested annotation regions. A leader annotation 
    9898contains the name of the code transformation module used to transform and 
     
    342342 
    343343Current optimizations supported by Orio span different types of code 
    344 transformations that are not provided by production compilers in some 
     344transformation that are not provided by production compilers in some 
    345345cases. Available optimizations include simple loop unrolling, memory 
    346346alignment optimization, loop unroll/jamming, loop tiling, loop permutation, 
  • orio/doc/papers/ipdps09/integration.tex

    r443 r451  
    3939optimization parameters code. The loop code is embedded in the 
    4040annotation body block, while the performance parameters code is 
    41 written in the module body block. A simple parser implemented inside 
     41written in the module body block. A parser implemented inside 
    4242the new module parses the performance parameters code to extract their 
    4343values. The extracted values of the performance parameters include 
     
    5858described in Section~\ref{sec:implm}. 
    5959 
    60 Pluto also performs syntactic loop unroll/jamming as a post-processing 
     60Pluto also performs syntactic loop unroll/jam in a post-processing 
    6161pass. The target loops, however, are limited only to innermost loops with a 
    6262maximum depth of two (i.e., 1-D unrolling and 2-D unroll/jamming). So we 
    63 choose to use the loop unroll/jamming transformation already available in 
     63choose to use the loop unroll/jam transformation already available in 
    6464Orio, which is more flexible because it can be applied to loop nests of 
    6565depths larger than two. 
     
    6767 
    6868 
    69 At present Orio does not employ any data dependency analysis when performing 
     69At present Orio does not employ any data dependence analysis when performing 
    7070syntactic transformations. Therefore, there is no guarantee that the code 
    7171generated after syntactic transformations is correct. Because of this, the 
  • orio/doc/papers/ipdps09/motivation.tex

    r428 r451  
    2828loop-level optimizations. Manual tuning is time-consuming and impedes 
    2929readability and performance portability. Tuned libraries often deliver 
    30 great performance without requiring significant programming effort, 
     30excellent performance without requiring significant programming effort, 
    3131but can only provide limited functionality. General-purpose source 
    3232transformation tools for performance optimizations are few and have 
  • orio/doc/papers/ipdps09/notes.txt

    r445 r451  
    55@INPROCEEDINGS{Hart0000:Orio, 
    66AUTHOR="Albert Hartono and Boyana Norris and Ponnuswamy Sadayappan", 
    7 TITLE="Annotation-Based Empirical Performance Tuning Using Orio", 
     7TITLE="Annotation-Based Empirical Performance Tuning Using {Orio}", 
    88BOOKTITLE="23rd IEEE International Parallel \& Distributed Processing Symposium", 
    99ADDRESS="Rome, Italy", 
    1010KEYWORDS="empirical performance tuning, code generation", 
    11 ABSTRACT="In many scientific applications, significant time is spent in tuning codes 
    12 for a particular high-performance architecture. Tuning approaches range 
    13 from the relatively nonintrusive (e.g., by using compiler options) to 
    14 extensive code modifications that attempt to exploit specific architecture 
    15 features. Intrusive techniques often result in code changes that are not 
    16 easily reversible, which can negatively impact readability, 
    17 maintainability, and performance on different architectures. We introduce 
    18 an extensible annotation-based empirical tuning system called Orio, which 
    19 is aimed at improving both performance and productivity by enabling 
    20 software developers to insert annotations in the form of structured 
    21 comments into their source code that trigger a number of low-level 
    22 performance optimizations on a specified code fragment.  To maximize the 
    23 performance tuning opportunities, the annotation processing infrastructure 
    24 is designed to support both architecture-independent and 
    25 architecture-specific code optimizations. Given the annotated code as 
    26 input, Orio generates many tuned versions of the same operation and 
    27 empirically evaluates the alternatives to select the best performing 
    28 version for production use. We have also enabled the use of the Pluto 
    29 automatic parallelization tool in conjunction with Orio to generate 
    30 efficient OpenMP-based parallel code. We describe our experimental results 
     11ABSTRACT="For many scientific applications, significant time is spent in tuning 
     12codes for a particular high-performance architecture. Tuning 
     13approaches range from the relatively nonintrusive (e.g., by using 
     14compiler options) to extensive code modifications that attempt to 
     15exploit specific architecture features. Intrusive techniques often 
     16result in code changes that are not easily reversible, and can 
     17negatively impact readability, maintainability, and performance on 
     18different architectures. We introduce an extensible annotation-based 
     19empirical tuning system called Orio that is aimed at improving both 
     20performance and productivity. It allows software developers to insert 
     21annotations in the form of structured comments into their source code 
     22to trigger a number of low-level performance optimizations on a 
     23specified code fragment.  To maximize the performance tuning 
     24opportunities, the annotation processing infrastructure is designed to 
     25support both architecture-independent and architecture-specific code 
     26optimizations. Given the annotated code as input, Orio generates many 
     27tuned versions of the same operation and empirically evaluates the 
     28alternatives to select the best performing version for production 
     29use. We have also enabled the use of the Pluto automatic 
     30parallelization tool in conjunction with Orio to generate efficient 
     31OpenMP-based parallel code. We describe our experimental results 
    3132involving a number of computational kernels, including dense array and 
    32 sparse matrix operations." 
    33 
     33sparse matrix operations.} 
    3434 
    3535 
  • orio/doc/papers/ipdps09/paper.bib

    r449 r451  
    265265PAGES          = {308-323} } 
    266266 
    267 @BOOK{laug, 
    268 AUTHOR         = {E. Anderson and Z. Bai and C. Bischof and J. Demmel and 
    269                  J. Dongarra and J. Du Croz and A. Greenbaum and S. Hammarling 
    270                  and A. McKenney and S. Ostrouchov and D. Sorensen}, 
    271 TITLE          = {{LAPACK} Users' Guide}, 
    272 EDITION        = {Second}, 
    273 PUBLISHER      = siam, 
    274 YEAR           = {1995}, 
    275 ADDRESS        = {Philadelphia, PA} } 
     267@book{laug, 
     268 author ={Anderson,, E. and Bai,, Z. and Bischof,, C. and Demmel,, J. and Dongarra,, J. and Du Croz,, J. and Greenbaum,, A. and Hammarling,, S. and McKenney,, A. and Ostrouchov,, S. and Sorensen,, D.}, 
     269 title ={LAPACK's user's guide}, 
     270 year ={1992}, 
     271 isbn ={0-89871-294-7}, 
     272 publisher ={Society for Industrial and Applied Mathematics}, 
     273 address ={Philadelphia, PA, USA}, 
     274 } 
    276275 
    277276@TECHREPORT{lawn111, 
     
    409408   author = {Ken Kennedy and others}, 
    410409   title = {Telescoping Languages Project Description}, 
    411    howpublished = {\url{telescoping.rice.edu}}, 
     410   howpublished = {\url{http://telescoping.rice.edu/}}, 
    412411year = 2006 
    413412} 
     
    418417  title = {Telescoping Languages: A System for Automatic Generation of Domain Languages}, 
    419418  year = {2005}, 
    420   URL = {http://ieeexplore.ieee.org/xpl/abs\_free.jsp?arNumber=1386658 }, 
    421419  abstract = { The software gap - the discrepancy between the need for new software and the aggregate capacity of the workforce to produce it - is a serious problem for scientific software. Although users appreciate the convenience (and, thus, improved productivity) of using relatively high-level scripting languages, the slow execution speeds of these languages remain a problem. Lower level languages, such as C and Fortran, provide better performance for production applications, but at the cost of tedious programming and optimization by experts. If applications written in scripting languages could be routinely compiled into highly optimized machine code, a huge productivity advantage would be possible. It is not enough, however, to simply develop excellent compiler technologies for scripting languages (as a number of projects have succeeded in doing for MATLAB). In practice, scientists typically extend these languages with their own domain-centric components, such as the MATLAB signal processing toolbox. Doing so effectively defines a new domain-specific language. If we are to address efficiency problems for such extended languages, we must develop a framework for automatically generating optimizing compilers for them. To accomplish this goal, we have been pursuing an innovative strategy that we call telescoping languages. Our approach calls for using a library-preprocessing phase to extensively analyze and optimize collections of libraries that define an extended language. Results of this analysis are collected into annotated libraries and used to generate a library-aware optimizer. The generated library-aware optimizer uses the knowledge gathered during preprocessing to carry out fast and effective optimization of high-level scripts. This enables script optimization to benefit from the intense analysis performed during preprocessing without repaying its price. Since library preprocessing is performed only at infrequent "language-generation" times, its cost is amortized over many compilations of individual scripts that use the library. We call this strategy "telescoping languages" because it merges knowledge of a hierarchy of extended languages into a single library-aware optimizer. We present our vision and plans for compiler frameworks based on telescoping languages and - report on the preliminary research that has established the effectiveness of this approach. }, 
    422420  journal = {Proceedings of the IEEE}, 
     
    961959  pages =        {156--165}, 
    962960  year =         {1993}, 
    963   url = "citeseer.ist.psu.edu/weise93programmable.html" 
    964961} 
    965962 
     
    10891086    pages = "340--347", 
    10901087    year = "1997", 
    1091     url = "citeseer.ist.psu.edu/article/bilmes97optimizing.html"
     1088
    10921089 
    10931090@misc{ vuduc03memory, 
     
    14201417} 
    14211418 
    1422 @InProceedings{POET, 
     1419@inproceedings{POET, 
     1420author = {Q. Yi and K. Seymour and H. You and R. Vuduc and D. Quinlan}, 
     1421booktitle = {Workshop on Performance Optimization of High-Level Languages and Libraries (POHLL)}, 
     1422title = {{POET}: Parameterized Optimizations for Empirical Tuning}, 
     1423   publisher={IEEE Computer Society}, 
     1424   pages = {1--8}, 
     1425   location = {Long Beach, CA}, 
     1426month = {March}, 
     1427year = {2007} 
     1428
     1429 
     1430 
     1431@InProceedings{POET1, 
    14231432   author = {Qing Yi and Keith Seymour and Haihang You and Richard Vuduc and Dan Quinlan}, 
    14241433   title = {{POET}: Parameterized Optimizations for Empirical Tuning}, 
    14251434   booktitle = {Proceedings of the Parallel and Distributed Processing Symposium, 2007}, 
    14261435   year = {2007}, 
    1427    publisher={IEEE}, 
     1436   publisher={IEEE Computer Society}, 
    14281437   pages = {1--8}, 
    1429    DOI={10.1109/IPDPS.2007.370637}, 
    14301438   location = {Long Beach, CA}, 
    14311439   month = {Mar} 
     
    17761784 
    17771785 
    1778 @inproceedings{Siek:1998ys, 
    1779         Author = {Jeremy G. Siek and Andrew Lumsdaine}, 
    1780         Booktitle = {Parallel Object Oriented Scientific Computing}, 
    1781         Date-Added = {2006-09-26 11:13:25 -0600}, 
    1782         Date-Modified = {2007-05-18 13:42:22 -0600}, 
    1783         Organization = {ECOOP}, 
    1784         Title = {A Rational Approach to Portable High Performance: The {Basic Linear Algebra Instruction Set (BLAS)} 
    1785 and the {Fixed Algorithm Size Template (FAST)} Library}, 
    1786         Year = 1998} 
     1786@InProceedings{Siek:1998ys, 
     1787  title ="A Rational Approach to Portable High Performance: The 
     1788Basic Linear Algebra Instruction Set ({BLAIS}) and the 
     1789Fixed Algorithm Size Template ({FAST}) Library", 
     1790  author ="Jeremy G. Siek and Andrew Lumsdaine", 
     1791  bibdate ="2002-08-04", 
     1792  bibsource ="DBLP, http://dblp.uni-trier.de/db/conf/ecoopw/ecoopw98.html#SiekL98a", 
     1793  booktitle ="ECOOP Workshops", 
     1794  booktitle ="Object-Oriented Technology, {ECOOP}'98 Workshop 
     1795 Reader, {ECOOP}'98 Workshops, Demos, and Posters, 
     1796 Brussels, Belgium, July 20-24, 1998, Proceedings", 
     1797  publisher ="Springer", 
     1798  year = "1998", 
     1799  volume ="1543", 
     1800  editor ="Serge Demeyer and Jan Bosch", 
     1801  ISBN = "3-540-65460-7", 
     1802  pages ="468--469", 
     1803  series ="Lecture Notes in Computer Science", 
     1804
    17871805 
    17881806@InCollection{Hall05, 
     
    18391857  pages =       "490--503", 
    18401858  series =      "Lecture Notes in Computer Science", 
    1841   URL =         "http://dx.doi.org/10.1007/978-3-540-71351-7_38", 
    18421859} 
    18431860 
     
    18891906} 
    18901907 
    1891 @phdthesis{vuduc-thesis, 
    1892  author = {Richard Wilson Vuduc}, 
    1893  note = {Chair-James W. Demmel}, 
    1894  title = {Automatic performance tuning of sparse matrix kernels}, 
    1895  year = {2003}, 
    1896  order_no = {AAI3121741}, 
    1897  publisher = {University of California, Berkeley}, 
    1898  } 
     1908@PhDThesis{vuduc-thesis, 
     1909author = {Richard W. Vuduc}, 
     1910title = {Automatic performance tuning of sparse matrix kernels}, 
     1911school = {University of California, Berkeley}, 
     1912month = {December}, 
     1913year = {2003} 
     1914
    18991915 
    19001916@inproceedings{williams-sc07, 
     
    19631979} 
    19641980 
    1965 @article{gprof, 
    1966  author = {Susan L. Graham and Peter B. Kessler and Marshall K. McKusick}, 
    1967  title = {Gprof: {A} call graph execution profiler}, 
    1968  journal = {SIGPLAN Not.}, 
    1969  volume = {39}, 
    1970  number = {4}, 
    1971  year = {2004}, 
    1972  issn = {0362-1340}, 
    1973  pages = {49--57}, 
    1974  doi = {http://doi.acm.org/10.1145/989393.989401}, 
    1975  publisher = {ACM}, 
    1976  address = {New York, NY, USA}, 
    1977  } 
     1981@Article{gprof, 
     1982  author ="S. L. Graham and P. B. Kessler and M. K. McKusick", 
     1983  title ="gprof: {A} Call Graph Execution Profiler", 
     1984  journal ="SIGPLAN Notices", 
     1985  volume ="17", 
     1986  number ="6", 
     1987  pages ="120", 
     1988  month =jun, 
     1989  year ="1982", 
     1990  keywords ="UNIX utility performance", 
     1991
    19781992 
    19791993 
  • orio/doc/papers/ipdps09/paper.tex

    r433 r451  
    7070many productive discussions and his valuable help with Pluto. This 
    7171work was supported in part by the U.S. Dept. of Energy under Contract 
    72 DE-AC02-06CH11357. 
     72DE-AC02-06CH11357, the National Science Foundation through awards 
     730403342, 0509467, and 0811781, and a State of Ohio Development Fund. 
    7374 
    7475\small 
  • orio/doc/papers/ipdps09/related.tex

    r437 r451  
    5151digital signal processing libraries by an extensive empirical search 
    5252over implementation variants.  GotoBLAS~\cite{Goto:2006fk,Goto:fk}, on 
    53 the other hand, achieves great performance results on several 
     53the other hand, achieves near-peak performance on several 
    5454architectures by using hand-tuned data structures and kernel 
    5555operations.  These auto- or hand-tuned approaches can deliver 
     
    8282and thus the interfaces are not necessarily based on concepts accessible to 
    8383computational scientists. The complexity of existing annotation languages and lack 
    84 of common syntaxes for transformations (e.g., loop unrolling) result 
     84of common syntax for transformations (e.g., loop unrolling) result 
    8585in steep learning curves and the inability to take advantage of more than one 
    8686approach at a time. Furthermore, at present, there is no good way for 
  • orio/doc/papers/ipdps09/results.tex

    r444 r451  
    1414on the Xeon and Blue Gene/P, respectively. Because of space considerations, 
    1515here we discuss a limited number of computational kernels; more performance 
    16 results for different computations are available at the Orio project 
     16data for different computations is available at the Orio project 
    1717website~\cite{OrioURL}. 
    1818 
     
    117117The SpMV operation computes $\forall{A_{i,j}} \neq 0:y_{i} 
    118118\leftarrow y_{i} + A_{i,j} \cdot x_{j}$, where $A$ is a sparse matrix, 
    119 and $x$, $y$ are dense vectors. Each element of $A$ is used precisely once, 
     119and $x$, $y$ are dense vectors. Each element of $A$ is used only once, 
    120120and element reuse is possible only for $x$ and $y$. Thus, to optimize SpMV, 
    121121one should use compact data structures for $A$ and try to maximize temporal 
     
    279279facilitate OpenMP parallelization. The SMP and VN results show that 
    280280optimizations using loop-level parallelism (through OpenMP) achieve much 
    281 better performance than using MPI coarse-grained parallelism. 
     281better performance than using coarse-grained MPI parallelism. 
    282282%Moreover, for the multithread cases, tuning using Orio determined that 
    283283%the performance benefit of SIMDization is more significant than that of