Automatic Portable Performance for Heterogeneous Multi-cores
Automatic Portable Performance for Heterogeneous Multi-cores
Disciplines
Computer Sciences (100%)
Keywords
-
Compiler,
Parallel,
Heterogenous,
Machine Learning,
Modelling,
Optimisation
The efficient mapping of program parallelism to multi-core, heterogeneous processors is extremely challenging and highly dependent on the underlying architecture. The overall objective of this project is to investigate a novel parallel compiler approach that can automatically learn how to best map program parallelism to multi-core, heterogeneous platforms. Rather than hard-coding a compiler strategy for each parallel platform, we aim to explore an innovative, portable, parallel compiler approach that can automatically self-adapt to any heterogeneous hardware and can improve its performance over time. This is achieved by employing machine learning approaches which first learn the optimisation space off-line and then automatically derive a strategy that attempts to generate the "best" mapping for any user program. This predictive modelling approach can be further extended to on-line adaptation to manage contention for resources. This project is aimed at exploring multi-core system software that is "future scalable" and if successful, will have a wide range of applications from high performance computing, to desktops, to embedded mobile devices. It will allow portable performance of parallel programs across platforms. Given the increasing prevalence of heterogeneous multi-cores such an approach, if successful, will have international academic impact and be of significant benefit to EU industry. Multi-core based processors are common-place and are widely seen as the most viable means of delivering performance with increasing transistor densities. Concurrent with this rise in multi-cores has been the adoption of specialised or customised cores such as general-purpose graphics processing units (GP-GPU). Unfortunately, there is no clear way for application software to harness this performance. Current compiler technology, whose role is to map software to the underlying hardware is simply incapable of doing this, requiring significant manual intervention from the programmer, tuning for each system. As the number and diversity of cores increases, this software gap grows larger and will be the fundamental critical issue for computing systems in 5 years time.
The efficient parallelization and optimization of computer programs on modern heterogeneous parallel computers is extremely challenging and highly dependent on the underlying architecture. The overall objective of this project was to investigate a novel parallel compiler approach that can automatically learn how to best map program parallelism to modern parallel computers with GPUs (graphics processing units) that are widely available in consumer, industry and scientific computers. Rather than hard-coding a compiler strategy for each parallel computer, we aimed to explore an innovative, portable, parallel compiler approach that can automatically self-adapt to any heterogeneous hardware and can improve its performance over time. This has been achieved by employing machine learning approaches which first learn the optimization space off-line and then automatically derive a strategy that attempts to generate the best mapping for any user program. For our project we focused on OpenCL which is the first open standard for cross-platform parallel computing. Writing programs for heterogeneous systems is a challenging task due to the difference in processing capabilities, memory availability, and communication latencies of different computational resources. Our research conducted as part of this project, included load balancing of OpenCL tasks over heterogeneous compute nodes. It is important to understand, that the best performing task partitioning commonly is application, input data and machine specific. We developed a new solution based on machine learning for which we demonstrated a performance improvement of approximately 25 % over previous approaches based on a large variety of test codes. A crucial component of this solution was a prediction model which has been created based on artificial neural networks and principal component analysis that tries to estimate an efficient task partitioning and thereby guides our compiler-based optimization. This approach has been implemented as part of the Insieme source to source compiler as developed and maintained by the University of Innsbruck. Furthermore, we - for the first time - extended OpenCL to work also for clusters of GPUs and achieved an efficiency of 64 % for a variety of codes. The main development goals were to achieve performance improvement and productivity by simplifying program development. This project aimed at exploring multi-core system software that is future scalable and has a wide range of applications from high performance computing, to desktops, to embedded mobile devices. It allows portable performance of parallel programs across platforms. Given the increasing prevalence of heterogeneous multi-cores on a large variety of computers, our work and results are a good starting point to effectively use heterogeneous computers with reduced execution times and increased productivity for parallel programs.
- Universität Innsbruck - 100%
Research Output
- 323 Citations
- 18 Publications
-
2012
Title Automatic OpenMP Loop Scheduling: A Combined Compiler and Runtime Approach DOI 10.1007/978-3-642-30961-8_7 Type Book Chapter Author Thoman P Publisher Springer Nature Pages 88-101 -
2012
Title Tuning MPI Runtime Parameter Setting for High Performance Computing DOI 10.1109/clusterw.2012.15 Type Conference Proceeding Abstract Author Pellegrini S Pages 213-221 -
2012
Title A Multi-Objective Auto-Tuning Framework for Parallel Codes DOI 10.1109/sc.2012.7 Type Conference Proceeding Abstract Author Jordan H Pages 1-12 -
2012
Title Low-Latency Collectives for the Intel SCC DOI 10.1109/cluster.2012.58 Type Conference Proceeding Abstract Author Kohler A Pages 346-354 -
2012
Title The JavaSymphony Extensions for Parallel GPU Computing DOI 10.1109/icpp.2012.56 Type Conference Proceeding Abstract Author Aleem M Pages 30-39 -
2011
Title Visual Data Mining Using the Point Distribution Tensor. Type Conference Proceeding Abstract Author Leimer W Et Al -
2011
Title Performance Analysis and Benchmarking of the Intel SCC DOI 10.1109/cluster.2011.24 Type Conference Proceeding Abstract Author Gschwandtner P Pages 139-149 -
2014
Title Kd-tree based N-Body Simulations with Volume-Mass Heuristic on the GPU DOI 10.1109/ipdpsw.2014.141 Type Conference Proceeding Abstract Author Kofler K Pages 1256-1265 -
2014
Title Energy Efficient HPC on Embedded SoCs: Optimization Techniques for Mali GPU DOI 10.1109/ipdps.2014.24 Type Conference Proceeding Abstract Author Grasso I Pages 123-132 -
2014
Title SAMPO: an Agent-based mosquito point model in OpenCL. Type Conference Proceeding Abstract Author Gesing S Et Al Conference Proceedings of the 2014 Symposium on Agent Directed Simulation. -
2013
Title An automatic input-sensitive approach for heterogeneous task partitioning DOI 10.1145/2464996.2465007 Type Conference Proceeding Abstract Author Kofler K Pages 149-160 -
2013
Title Automatic problem size sensitive task partitioning on heterogeneous parallel systems DOI 10.1145/2442516.2442545 Type Conference Proceeding Abstract Author Grasso I Pages 281-282 -
2013
Title LibWater DOI 10.1145/2464996.2465008 Type Conference Proceeding Abstract Author Grasso I Pages 161-172 -
2012
Title OpenMP in a Heterogeneous World, 8th International Workshop on OpenMP, IWOMP 2012, Rome, Italy, June 11-13, 2012. Proceedings DOI 10.1007/978-3-642-30961-8 Type Book Publisher Springer Nature -
2012
Title A multi-objective auto-tuning framework for parallel codes. Type Conference Proceeding Abstract Author Moritsch H Et Al -
2014
Title Random Fields Generation on the GPU with the Spectral Turning Bands Method DOI 10.1007/978-3-319-09873-9_55 Type Book Chapter Author Hunger L Publisher Springer Nature Pages 656-667 -
2014
Title A uniform approach for programming distributed heterogeneous computing systems DOI 10.1016/j.jpdc.2014.08.002 Type Journal Article Author Grasso I Journal Journal of Parallel and Distributed Computing Pages 3228-3239 Link Publication -
2013
Title LibWater: heterogeneous distributed computing made easy. Type Conference Proceeding Abstract Author Fahringer T Et Al