Navigation

← Previous Changeset
Next Changeset →

Changeset 451

Timestamp:

02/15/09 23:38:08 (1 day ago)

Author:

hartono

Message:

Files:

orio/doc/papers/ipdps09/abstract.tex (modified) (1 diff)
orio/doc/papers/ipdps09/final-version.pdf (modified) (previous)
orio/doc/papers/ipdps09/implementation.tex (modified) (4 diffs)
orio/doc/papers/ipdps09/integration.tex (modified) (3 diffs)
orio/doc/papers/ipdps09/motivation.tex (modified) (1 diff)
orio/doc/papers/ipdps09/notes.txt (modified) (1 diff)
orio/doc/papers/ipdps09/paper.bib (modified) (10 diffs)
orio/doc/papers/ipdps09/paper.tex (modified) (1 diff)
orio/doc/papers/ipdps09/related.tex (modified) (2 diffs)
orio/doc/papers/ipdps09/results.tex (modified) (3 diffs)

Legend:

: Unmodified
: Added
: Removed
: Modified
: Copied
: Moved

orio/doc/papers/ipdps09/abstract.tex

r427	r451
1		In many scientific applications, significant time is spent in tuning
	1	For many scientific applications, significant time is spent in tuning
2	2	codes for a particular high-performance architecture. Tuning
3	3	approaches range from the relatively nonintrusive (e.g., by using
4	4	compiler options) to extensive code modifications that attempt to
5	5	exploit specific architecture features. Intrusive techniques often
6		result in code changes that are not easily reversible, ~~which~~ can
	6	result in code changes that are not easily reversible, and can
7	7	negatively impact readability, maintainability, and performance on
8	8	different architectures. We introduce an extensible annotation-based
9		empirical tuning system called Orio~~, which~~ is aimed at improving both
10		performance and productivity ~~by enabling~~ software developers to insert
	9	empirical tuning system called Orio that is aimed at improving both
	10	performance and productivity. It allows software developers to insert
11	11	annotations in the form of structured comments into their source code
12		t~~hat~~ trigger a number of low-level performance optimizations on a
	12	to trigger a number of low-level performance optimizations on a
13	13	specified code fragment. To maximize the performance tuning
14	14	opportunities, the annotation processing infrastructure is designed to

orio/doc/papers/ipdps09/implementation.tex

r443	r451
50	50	as much as possible).
51	51	%
52		Finally, Orio is usable in real scientific applications without
	52	Finally, Orio is usable with real scientific applications without
53	53	requiring reimplementation. This ensures that the significant investments in
54	54	the development of complex scientific codes is leveraged to the greatest
…	…
72	72	for each distinct combination of performance parameter values. Then each
73	73	optimized code version is executed and its performance evaluated. After
74		iteratively evaluating all code variants, the best-performing code is ~~pick~~ed
	74	iteratively evaluating all code variants, the best-performing code is selected
75	75	as the final output of Orio. Because the search space of all possible
76	76	optimized code versions can be huge, a brute force search strategy is not
…	…
89	89
90	90	Orio annotations are embedded into the source code as structured C comments
91		that ~~always~~ start with \texttt{/@} and end with \texttt{@/}. For example,
	91	that start with \texttt{/@} and end with \texttt{@/}. For example,
92	92	the annotation \texttt{/@ end @/} is used to indicate the end of an
93	93	annotated code region. A simple grammar illustrating the basic syntax of Orio
94	94	annotations is depicted in Figure~\ref{fig:ann-lang}. An annotation region
95	95	consists of three main parts: leader annotation, annotation body, and trailer
96		annotation. The annotation body can ~~be either~~ empty or contain C code that
	96	annotation. The annotation body can either be empty or contain C code that
97	97	possibly includes other nested annotation regions. A leader annotation
98	98	contains the name of the code transformation module used to transform and
…	…
342	342
343	343	Current optimizations supported by Orio span different types of code
344		transformations that are not provided by production compilers in some
	344	transformation that are not provided by production compilers in some
345	345	cases. Available optimizations include simple loop unrolling, memory
346	346	alignment optimization, loop unroll/jamming, loop tiling, loop permutation,

orio/doc/papers/ipdps09/integration.tex

r443	r451
39	39	optimization parameters code. The loop code is embedded in the
40	40	annotation body block, while the performance parameters code is
41		written in the module body block. A ~~simple~~ parser implemented inside
	41	written in the module body block. A parser implemented inside
42	42	the new module parses the performance parameters code to extract their
43	43	values. The extracted values of the performance parameters include
…	…
58	58	described in Section~\ref{sec:implm}.
59	59
60		Pluto also performs syntactic loop unroll/jam~~ming as~~ a post-processing
	60	Pluto also performs syntactic loop unroll/jam in a post-processing
61	61	pass. The target loops, however, are limited only to innermost loops with a
62	62	maximum depth of two (i.e., 1-D unrolling and 2-D unroll/jamming). So we
63		choose to use the loop unroll/jam~~ming~~ transformation already available in
	63	choose to use the loop unroll/jam transformation already available in
64	64	Orio, which is more flexible because it can be applied to loop nests of
65	65	depths larger than two.
…	…
67	67
68	68
69		At present Orio does not employ any data dependency analysis when performing
	69	At present Orio does not employ any data dependence analysis when performing
70	70	syntactic transformations. Therefore, there is no guarantee that the code
71	71	generated after syntactic transformations is correct. Because of this, the

orio/doc/papers/ipdps09/motivation.tex

r428	r451
28	28	loop-level optimizations. Manual tuning is time-consuming and impedes
29	29	readability and performance portability. Tuned libraries often deliver
30		~~grea~~t performance without requiring significant programming effort,
	30	excellent performance without requiring significant programming effort,
31	31	but can only provide limited functionality. General-purpose source
32	32	transformation tools for performance optimizations are few and have

orio/doc/papers/ipdps09/notes.txt

r445	r451
5	5	@INPROCEEDINGS{Hart0000:Orio,
6	6	AUTHOR="Albert Hartono and Boyana Norris and Ponnuswamy Sadayappan",
7		TITLE="Annotation-Based Empirical Performance Tuning Using ~~Orio~~",
	7	TITLE="Annotation-Based Empirical Performance Tuning Using {Orio}",
8	8	BOOKTITLE="23rd IEEE International Parallel \& Distributed Processing Symposium",
9	9	ADDRESS="Rome, Italy",
10	10	KEYWORDS="empirical performance tuning, code generation",
11		ABSTRACT="In many scientific applications, significant time is spent in tuning codes
12		for a particular high-performance architecture. Tuning approaches range
13		from the relatively nonintrusive (e.g., by using compiler options) to
14		extensive code modifications that attempt to exploit specific architecture
15		features. Intrusive techniques often result in code changes that are not
16		easily reversible, which can negatively impact readability,
17		maintainability, and performance on different architectures. We introduce
18		an extensible annotation-based empirical tuning system called Orio, which
19		is aimed at improving both performance and productivity by enabling
20		software developers to insert annotations in the form of structured
21		comments into their source code that trigger a number of low-level
22		performance optimizations on a specified code fragment. To maximize the
23		performance tuning opportunities, the annotation processing infrastructure
24		is designed to support both architecture-independent and
25		architecture-specific code optimizations. Given the annotated code as
26		input, Orio generates many tuned versions of the same operation and
27		empirically evaluates the alternatives to select the best performing
28		version for production use. We have also enabled the use of the Pluto
29		automatic parallelization tool in conjunction with Orio to generate
30		efficient OpenMP-based parallel code. We describe our experimental results
	11	ABSTRACT="For many scientific applications, significant time is spent in tuning
	12	codes for a particular high-performance architecture. Tuning
	13	approaches range from the relatively nonintrusive (e.g., by using
	14	compiler options) to extensive code modifications that attempt to
	15	exploit specific architecture features. Intrusive techniques often
	16	result in code changes that are not easily reversible, and can
	17	negatively impact readability, maintainability, and performance on
	18	different architectures. We introduce an extensible annotation-based
	19	empirical tuning system called Orio that is aimed at improving both
	20	performance and productivity. It allows software developers to insert
	21	annotations in the form of structured comments into their source code
	22	to trigger a number of low-level performance optimizations on a
	23	specified code fragment. To maximize the performance tuning
	24	opportunities, the annotation processing infrastructure is designed to
	25	support both architecture-independent and architecture-specific code
	26	optimizations. Given the annotated code as input, Orio generates many
	27	tuned versions of the same operation and empirically evaluates the
	28	alternatives to select the best performing version for production
	29	use. We have also enabled the use of the Pluto automatic
	30	parallelization tool in conjunction with Orio to generate efficient
	31	OpenMP-based parallel code. We describe our experimental results
31	32	involving a number of computational kernels, including dense array and
32		sparse matrix operations."
33		}
	33	sparse matrix operations.}
34	34
35	35

orio/doc/papers/ipdps09/paper.bib

r449	r451
265	265	PAGES = {308-323} }
266	266
267		@BOOK{laug,
268		AUTHOR = {E. Anderson and Z. Bai and C. Bischof and J. Demmel and
269		J. Dongarra and J. Du Croz and A. Greenbaum and S. Hammarling
270		and A. McKenney and S. Ostrouchov and D. Sorensen},
271		TITLE = {{LAPACK} Users' Guide},
272		EDITION = {Second},
273		PUBLISHER = siam,
274		YEAR = {1995},
275		ADDRESS = {Philadelphia, PA} }
	267	@book{laug,
	268	author ={Anderson,, E. and Bai,, Z. and Bischof,, C. and Demmel,, J. and Dongarra,, J. and Du Croz,, J. and Greenbaum,, A. and Hammarling,, S. and McKenney,, A. and Ostrouchov,, S. and Sorensen,, D.},
	269	title ={LAPACK's user's guide},
	270	year ={1992},
	271	isbn ={0-89871-294-7},
	272	publisher ={Society for Industrial and Applied Mathematics},
	273	address ={Philadelphia, PA, USA},
	274	}
276	275
277	276	@TECHREPORT{lawn111,
…	…
409	408	author = {Ken Kennedy and others},
410	409	title = {Telescoping Languages Project Description},
411		howpublished = {\url{~~telescoping.rice.edu~~}},
	410	howpublished = {\url{http://telescoping.rice.edu/}},
412	411	year = 2006
413	412	}
…	…
418	417	title = {Telescoping Languages: A System for Automatic Generation of Domain Languages},
419	418	year = {2005},
420		~~URL = {http://ieeexplore.ieee.org/xpl/abs\_free.jsp?arNumber=1386658 },~~
421	419	abstract = { The software gap - the discrepancy between the need for new software and the aggregate capacity of the workforce to produce it - is a serious problem for scientific software. Although users appreciate the convenience (and, thus, improved productivity) of using relatively high-level scripting languages, the slow execution speeds of these languages remain a problem. Lower level languages, such as C and Fortran, provide better performance for production applications, but at the cost of tedious programming and optimization by experts. If applications written in scripting languages could be routinely compiled into highly optimized machine code, a huge productivity advantage would be possible. It is not enough, however, to simply develop excellent compiler technologies for scripting languages (as a number of projects have succeeded in doing for MATLAB). In practice, scientists typically extend these languages with their own domain-centric components, such as the MATLAB signal processing toolbox. Doing so effectively defines a new domain-specific language. If we are to address efficiency problems for such extended languages, we must develop a framework for automatically generating optimizing compilers for them. To accomplish this goal, we have been pursuing an innovative strategy that we call telescoping languages. Our approach calls for using a library-preprocessing phase to extensively analyze and optimize collections of libraries that define an extended language. Results of this analysis are collected into annotated libraries and used to generate a library-aware optimizer. The generated library-aware optimizer uses the knowledge gathered during preprocessing to carry out fast and effective optimization of high-level scripts. This enables script optimization to benefit from the intense analysis performed during preprocessing without repaying its price. Since library preprocessing is performed only at infrequent "language-generation" times, its cost is amortized over many compilations of individual scripts that use the library. We call this strategy "telescoping languages" because it merges knowledge of a hierarchy of extended languages into a single library-aware optimizer. We present our vision and plans for compiler frameworks based on telescoping languages and - report on the preliminary research that has established the effectiveness of this approach. },
422	420	journal = {Proceedings of the IEEE},
…	…
961	959	pages = {156--165},
962	960	year = {1993},
963		~~url = "citeseer.ist.psu.edu/weise93programmable.html"~~
964	961	}
965	962
…	…
1089	1086	pages = "340--347",
1090	1087	year = "1997",
1091		~~url = "citeseer.ist.psu.edu/article/bilmes97optimizing.html"~~ }
	1088	}
1092	1089
1093	1090	@misc{ vuduc03memory,
…	…
1420	1417	}
1421	1418
1422		@InProceedings{POET,
	1419	@inproceedings{POET,
	1420	author = {Q. Yi and K. Seymour and H. You and R. Vuduc and D. Quinlan},
	1421	booktitle = {Workshop on Performance Optimization of High-Level Languages and Libraries (POHLL)},
	1422	title = {{POET}: Parameterized Optimizations for Empirical Tuning},
	1423	publisher={IEEE Computer Society},
	1424	pages = {1--8},
	1425	location = {Long Beach, CA},
	1426	month = {March},
	1427	year = {2007}
	1428	}
	1429
	1430
	1431	@InProceedings{POET1,
1423	1432	author = {Qing Yi and Keith Seymour and Haihang You and Richard Vuduc and Dan Quinlan},
1424	1433	title = {{POET}: Parameterized Optimizations for Empirical Tuning},
1425	1434	booktitle = {Proceedings of the Parallel and Distributed Processing Symposium, 2007},
1426	1435	year = {2007},
1427		publisher={IEEE},
	1436	publisher={IEEE Computer Society},
1428	1437	pages = {1--8},
1429		~~DOI={10.1109/IPDPS.2007.370637},~~
1430	1438	location = {Long Beach, CA},
1431	1439	month = {Mar}
…	…
1776	1784	}
1777	1785
1778		@inproceedings{Siek:1998ys,
1779		Author = {Jeremy G. Siek and Andrew Lumsdaine},
1780		Booktitle = {Parallel Object Oriented Scientific Computing},
1781		Date-Added = {2006-09-26 11:13:25 -0600},
1782		Date-Modified = {2007-05-18 13:42:22 -0600},
1783		Organization = {ECOOP},
1784		Title = {A Rational Approach to Portable High Performance: The {Basic Linear Algebra Instruction Set (BLAS)}
1785		and the {Fixed Algorithm Size Template (FAST)} Library},
1786		Year = 1998}
	1786	@InProceedings{Siek:1998ys,
	1787	title ="A Rational Approach to Portable High Performance: The
	1788	Basic Linear Algebra Instruction Set ({BLAIS}) and the
	1789	Fixed Algorithm Size Template ({FAST}) Library",
	1790	author ="Jeremy G. Siek and Andrew Lumsdaine",
	1791	bibdate ="2002-08-04",
	1792	bibsource ="DBLP, http://dblp.uni-trier.de/db/conf/ecoopw/ecoopw98.html#SiekL98a",
	1793	booktitle ="ECOOP Workshops",
	1794	booktitle ="Object-Oriented Technology, {ECOOP}'98 Workshop
	1795	Reader, {ECOOP}'98 Workshops, Demos, and Posters,
	1796	Brussels, Belgium, July 20-24, 1998, Proceedings",
	1797	publisher ="Springer",
	1798	year = "1998",
	1799	volume ="1543",
	1800	editor ="Serge Demeyer and Jan Bosch",
	1801	ISBN = "3-540-65460-7",
	1802	pages ="468--469",
	1803	series ="Lecture Notes in Computer Science",
	1804	}
1787	1805
1788	1806	@InCollection{Hall05,
…	…
1839	1857	pages = "490--503",
1840	1858	series = "Lecture Notes in Computer Science",
1841		~~URL = "http://dx.doi.org/10.1007/978-3-540-71351-7_38",~~
1842	1859	}
1843	1860
…	…
1889	1906	}
1890	1907
1891		@phdthesis{vuduc-thesis,
1892		author = {Richard Wilson Vuduc},
1893		note = {Chair-James W. Demmel},
1894		title = {Automatic performance tuning of sparse matrix kernels},
1895		year = {2003},
1896		order_no = {AAI3121741},
1897		publisher = {University of California, Berkeley},
1898		}
	1908	@PhDThesis{vuduc-thesis,
	1909	author = {Richard W. Vuduc},
	1910	title = {Automatic performance tuning of sparse matrix kernels},
	1911	school = {University of California, Berkeley},
	1912	month = {December},
	1913	year = {2003}
	1914	}
1899	1915
1900	1916	@inproceedings{williams-sc07,
…	…
1963	1979	}
1964	1980
1965		@article{gprof,
1966		author = {Susan L. Graham and Peter B. Kessler and Marshall K. McKusick},
1967		title = {Gprof: {A} call graph execution profiler},
1968		journal = {SIGPLAN Not.},
1969		volume = {39},
1970		number = {4},
1971		year = {2004},
1972		issn = {0362-1340},
1973		pages = {49--57},
1974		doi = {http://doi.acm.org/10.1145/989393.989401},
1975		publisher = {ACM},
1976		address = {New York, NY, USA},
1977		}
	1981	@Article{gprof,
	1982	author ="S. L. Graham and P. B. Kessler and M. K. McKusick",
	1983	title ="gprof: {A} Call Graph Execution Profiler",
	1984	journal ="SIGPLAN Notices",
	1985	volume ="17",
	1986	number ="6",
	1987	pages ="120",
	1988	month =jun,
	1989	year ="1982",
	1990	keywords ="UNIX utility performance",
	1991	}
1978	1992
1979	1993

orio/doc/papers/ipdps09/paper.tex

r433	r451
70	70	many productive discussions and his valuable help with Pluto. This
71	71	work was supported in part by the U.S. Dept. of Energy under Contract
72		DE-AC02-06CH11357.
	72	DE-AC02-06CH11357, the National Science Foundation through awards
	73	0403342, 0509467, and 0811781, and a State of Ohio Development Fund.
73	74
74	75	\small

orio/doc/papers/ipdps09/related.tex

r437	r451
51	51	digital signal processing libraries by an extensive empirical search
52	52	over implementation variants. GotoBLAS~\cite{Goto:2006fk,Goto:fk}, on
53		the other hand, achieves ~~great performance results~~ on several
	53	the other hand, achieves near-peak performance on several
54	54	architectures by using hand-tuned data structures and kernel
55	55	operations. These auto- or hand-tuned approaches can deliver
…	…
82	82	and thus the interfaces are not necessarily based on concepts accessible to
83	83	computational scientists. The complexity of existing annotation languages and lack
84		of common syntaxes for transformations (e.g., loop unrolling) result
	84	of common syntax for transformations (e.g., loop unrolling) result
85	85	in steep learning curves and the inability to take advantage of more than one
86	86	approach at a time. Furthermore, at present, there is no good way for

orio/doc/papers/ipdps09/results.tex

r444	r451
14	14	on the Xeon and Blue Gene/P, respectively. Because of space considerations,
15	15	here we discuss a limited number of computational kernels; more performance
16		~~results for different computations are~~ available at the Orio project
	16	data for different computations is available at the Orio project
17	17	website~\cite{OrioURL}.
18	18
…	…
117	117	The SpMV operation computes $\forall{A_{i,j}} \neq 0:y_{i}
118	118	\leftarrow y_{i} + A_{i,j} \cdot x_{j}$, where $A$ is a sparse matrix,
119		and $x$, $y$ are dense vectors. Each element of $A$ is used ~~precise~~ly once,
	119	and $x$, $y$ are dense vectors. Each element of $A$ is used only once,
120	120	and element reuse is possible only for $x$ and $y$. Thus, to optimize SpMV,
121	121	one should use compact data structures for $A$ and try to maximize temporal
…	…
279	279	facilitate OpenMP parallelization. The SMP and VN results show that
280	280	optimizations using loop-level parallelism (through OpenMP) achieve much
281		better performance than using ~~MPI coarse-grained~~ parallelism.
	281	better performance than using coarse-grained MPI parallelism.
282	282	%Moreover, for the multithread cases, tuning using Orio determined that
283	283	%the performance benefit of SIMDization is more significant than that of

Navigation

Changeset 451

Legend:

Download in other formats: