Optimal Code Generation for Explicitly Parallel Processors
Optimal Code Generation for Explicitly Parallel Processors
Disciplines
Computer Sciences (80%); Mathematics (20%)
Keywords
-
Compiler,
Instruction Level Parallelism,
VLIW,
Integer Linear Programming,
Scheduling,
Combinatorial Optimization
Embedded system designers today are faced with computational workloads that have been state-of-the-art in the supercomputing domain not long ago. However, normalized to the same technology, a 2 - 3x speedup with traditional techniques very roughly requires about an increase of 80x in area and, maybe even more important, about 12x in power consumption - a cost too high to bear for most pervasive mobile applications. A popular approach among embedded systems designers are VLIW architectures employing a large number of parallel datapaths and functional units as well as application specific instruction set extensions that are taylored to a particular application. Those systems rely on highly optimizing compilers that apply aggressive code transformations in order to acquire sufficient amounts of parallelism. However, ILP-oriented optimizations impose complex trade-offs that often cannot be accomplished by traditional heuristic approaches. Additionally, application- specific architectural restrictions and asymmetries challenge the development of compilers in several ways. First, the large number of application-specific architectural variants imposes a major engineering problem that is hard to cope with using traditional compiler backends. Second, heuristic techniques often fail to effectively exploit those architectural features, thus resulting in inefficient code. Techniques based on combinatorial optimization such as integer linear programming have the potential to overcome those deficiencies. Mathematical formulations allow for combined optimization of several subproblems and are capable to formulate complex architectural constraints that can be automatically derived from abstract machine descriptions. However, previously proposed approaches have achieved little relevance in practice. The main reason is that several arising subproblems are known to be NP complete in general and naive formulations inevitably lead to unacceptable compile time. The aim of this project is to develop new algorithms and mathematical formulations that maintain the advantages of integer linear programming based code generation techniques while remaining computational feasible for real- world programs. This includes the application of well-known techniques from the operations research domain to decrease the required solver time such as cutting plane algorithms, column generation techniques, or Lagrangian relaxation. As some of the subproblems are known to be computationally hard for real-world instances, we want to use the developed models also to learn when and why established heuristics fail, and to develop efficient approximation algorithms and near-optimal techniques that remain computationally feasible even for large problems. Our focus is on phase ordering issues among scheduling and register allocation, integrated scheduling of cyclic regions (software pipelining), and support for irregular architectural constraints and asymmetric datapaths. Instead of fully integrated techniques, we propose a novel compilation scheme that decomposes register allocation into several subtasks that can be solved separately. This approach allows to keep the scheduling formulation comparatively small by maintaining only simple additional invariants.
The focus of this project was on research for optimal and near optimal code generation for explicitly parallel processors. Explicitly parallel processors can execute more than one instruction every cycle, but the programmer (or the compiler) has to explicitly specify which instructions are independent from each other and can be executed in parallel in the same cycle. These processors are mostly used in embedded applications like smart phones and often have hardware constraints like clustered functional units and register files, predicated execution or rotating register files. Code generation is the final machine dependent part of a compiler and contains passes like instruction selection, register allocation, instruction scheduling, software pipelining, if-conversion and cluster assignment. Finding optimal solutions for these passes is often computationally intractable. Furthermore, these passes are usually executed independently from each other resulting in non optimal code. The results of this project were new heuristic and exact methods for optimal and near optimal integrated code generation. Heuristic methods give nearly as good results as optimal methods and are fast enough to be incorporated in production compilers.
- Technische Universität Wien - 100%
Research Output
- 44 Citations
- 14 Publications
-
2014
Title Integrated modulo scheduling and cluster assignment for TI TMS320C64x+ architecture DOI 10.1145/2568326.2568327 Type Conference Proceeding Abstract Author Kim N Pages 25-32 Link Publication -
2011
Title Fast JIT code generation for x86-64 with LLVM. Type Conference Proceeding Abstract Author Krall A Conference ACACES 2011 Poster Abstracts, 7, Fiuggi, Italy, 2011. HiPEAC. Posterpräsentation: ACACES 2011, Fiuggi, Italien; 2011-07-10 - 2011-07-16 -
2013
Title Static analysis of worst-case stack cache behavior DOI 10.1145/2516821.2516828 Type Conference Proceeding Abstract Author Jordan A Pages 55-64 -
2010
Title Graph-based cluster assignment for VLIW architectures. Type Conference Proceeding Abstract Author Jordan A Conference Posterpräsentation: 19th International Conference on Parallel Architectures and Compilation Techniques (PACT), Wien; 2010-09-11 - 2010-09-15 -
2010
Title Optimistic integrated instruction scheduling and register allocation. Type Conference Proceeding Abstract Author Barany G Conference Proceedings of the Junior Scientist Conference 2010; Posterpräsentation: Junior Scientist Conference 2010, Vienna; 2010-04-07 - 2010-04-09 -
2010
Title Optimistic integrated instruction scheduling and register allocation. Type Conference Proceeding Abstract Author Barany G Conference 15th Workshop on Compilers for Parallel Computing (CPC 2010) -
2010
Title Optimal and heuristic code generation for explicitly parallel processors. Type Conference Proceeding Abstract Author Barany G Conference ACACES 2010 Poster Abstracts. HiPEAC, 2010. Posterpräsentation: ACACES 2010, Terrassa (Barcelona), Spanien; 2010-07-11 - 2010-07-17 -
2014
Title Refinement of worst-case execution time bounds by graph pruning DOI 10.1016/j.cl.2014.09.001 Type Journal Article Author Brandner F Journal Computer Languages, Systems & Structures Pages 155-170 -
2014
Title Computation of alias sets from shape graphs for comparison of shape analysis precision DOI 10.1049/iet-sen.2012.0049 Type Journal Article Author Pavlu V Journal IET Software Pages 120-133 Link Publication -
2014
Title Evaluating and estimating the WCET criticality metric DOI 10.1145/2568326.2568331 Type Conference Proceeding Abstract Author Jordan A Pages 11-18 -
2013
Title IR-level versus machine-level if-conversion for predicated architectures DOI 10.1145/2443608.2443611 Type Conference Proceeding Abstract Author Jordan A Pages 3-10 Link Publication -
2013
Title Optimal and Heuristic Global Code Motion for Minimal Spilling DOI 10.1007/978-3-642-37051-9_2 Type Book Chapter Author Barany G Publisher Springer Nature Pages 21-40 Link Publication -
2012
Title Static profiling of the worst-case in real-time programs DOI 10.1145/2392987.2393000 Type Conference Proceeding Abstract Author Brandner F Pages 101-110 -
2011
Title Register reuse scheduling. Type Conference Proceeding Abstract Author Barany G Conference 9th Workshop on Optimizations for DSP and Embedded Systems (ODES'11)