Offline and Online Autotuning of Parallel Applications
Offline and Online Autotuning of Parallel Applications
Disciplines
Computer Sciences (100%)
Keywords
-
Autotuning,
MPI,
Benchmarking,
Reproducibility,
Performance Models
Many scientific applications, such as weather forecast or earthquake simulations, need to be execu- ted on large, parallel machines to speed up the computation. These parallel machines are comprised of hundreds or thousands of compute nodes, where each compute node is similar to a common desktop machine. These parallel applications are most often built on top of the Message Passing Interface (MPI), which is a standard for data communication. As a result, the run-time of these applications depends on the efficiency of the underlying MPI implementation. It is therefore of utmost importance to provide the best possible MPI implementation for a given system. Much research has been done to develop scalable, efficient implementations of specific MPI functions. For this reason, MPI libraries offer a large set of algorithms and provide many run-time parameters for the purpose of adapting (tuning) themselves to a given parallel machine. In our project, we will tackle the problem of optimizing the run-time parameters of MPI libraries in an automated fashion. The problem is that current MPI libraries provide several hundreds of tunable parameters, which results in a tremendously large search space. Therefore, a brute-force approach of testing every combination of parameters would take far too long and is thus impractical. Statistical methods can help us to successively reduce the number of parameters that need to be considered. In order to select the best possible algorithm for specific use cases, we apply modern machine learning techniques. Overall, we will devise and develop a software prototype that can automatically tune MPI libraries to a given parallel machine.
Scientific applications running on supercomputers are almost exclusively based on the Message Passing Interface (MPI), which defines a set of functions for allowing processes to communicate with each other. One type of communication is collective communication, where a group of processes works together to perform a task. For example, the broadcast operation allows one process to send data to all other processes in the group. The Autotune project focuses on strategies for automatic tuning of MPI collective communication operations. Tuning a collective operation means selecting the best algorithm and parameters for a given operation. We have developed a tuning prototype that monitors the performance of different algorithms and parameters for collective operations, and selects the best one based on the current workload and hardware characteristics. Two major problems had to be solved: 1. How do process arrival patterns impact the performance of collective operations? 2. How does synchronizing processes using a broadcast during benchmarking affect the benchmark results? Benchmarking using Arrival Patterns: We address the challenge of optimizing MPI collective communication by considering process arrival patterns. Arrival imbalances, common in real-world applications, significantly impact the performance of collective algorithms. Through simulations and micro-benchmarking, we demonstrate that rooted collectives like MPI_Reduce handle process skew better than non-rooted ones like MPI_Allreduce. We propose a methodology to enhance algorithm selection by profiling arrival patterns and applying the best-performing algorithm. Using the NAS Parallel Benchmarks' FT application, we show that considering arrival patterns improves performance. Benchmarking using Synchronized Clocks: We propose MPIX_Harmonize, an extension to the MPI standard that synchronizes processes in both space and time, minimizing artificial arrival patterns during benchmarking. This approach achieves synchronization accuracy around one microsecond, significantly improving over MPI_Barrier. By eliminating arrival pattern artifacts, MPIX_Harmonize ensures more reliable benchmarking of MPI collective operations. Our analysis demonstrates its effectiveness in producing accurate and consistent performance measurements, encouraging its adoption for high-performance computing environments. Tuning HPC Application at Runtime: We developed an online tuning strategy for MPI collective operations that dynamically selects algorithms based on performance data gathered during real application runs. This approach eliminates the need for prior offline benchmarking, making it more adaptable to changing workloads and hardware configurations. A key component of this strategy is the global performance model, which is iteratively updated during runtime. The model tracks the performance of different algorithms and adjusts their selection probabilities to optimize efficiency. For example, if a particular algorithm consistently performs well under certain conditions, its probability of being selected increases. To validate this approach, we used the miniAMR application, a benchmark for adaptive mesh refinement. Our experiments demonstrated significant performance gains for MPI_Allreduce operations.
- Technische Universität Wien - 50%
- Universität Wien - 50%
- Siegfried Benkner, Universität Wien , associated research partner
- Balazs Gerofi, Reserach Center for Computational Science - Japan
- George Bosilca, University of Tennessee - USA
Research Output
- 29 Citations
- 16 Publications
- 1 Datasets & models
- 1 Software
- 1 Scientific Awards
-
2025
Title Mpisee: Communicator-Centric Profiling of MPI Applications DOI 10.1002/cpe.70158 Type Journal Article Author Vardas I Journal Concurrency and Computation: Practice and Experience Link Publication -
2023
Title A Novel Triangular Space-Filling Curve for Cache-Oblivious In-Place Transposition of Square Matrices DOI 10.1109/ipdps54959.2023.00045 Type Conference Proceeding Abstract Author Alves J Pages 368-378 -
2024
Title Exploring Scalability in C++ Parallel STL Implementations DOI 10.1145/3673038.3673065 Type Conference Proceeding Abstract Author Laso R Pages 284-293 Link Publication -
2024
Title MPI Collective Algorithm Selection in the Presence of Process Arrival Patterns DOI 10.1109/cluster59578.2024.00017 Type Conference Proceeding Abstract Author Beni M Pages 108-119 -
2024
Title Modes, Persistence and Orthogonality: Blowing MPI Up DOI 10.1109/scw63240.2024.00061 Type Conference Proceeding Abstract Author Träff J Pages 404-413 -
2024
Title Exploring Mapping Strategies for Co-allocated HPC Applications DOI 10.1007/978-3-031-48803-0_31 Type Book Chapter Author Vardas I Publisher Springer Nature Pages 271-276 Link Publication -
2024
Title Analysis and prediction of performance variability in large-scale computing systems DOI 10.1007/s11227-024-06040-w Type Journal Article Author Salimi Beni M Journal The Journal of Supercomputing Pages 14978-15005 Link Publication -
2023
Title Synchronizing MPI Processes in Space and Time DOI 10.1145/3615318.3615325 Type Conference Proceeding Abstract Author Schuchart J Pages 1-11 -
2023
Title Verifying Performance Guidelines for MPI Collectives at Scale DOI 10.1145/3624062.3625532 Type Conference Proceeding Abstract Author Hunold S Pages 1264-1268 Link Publication -
2024
Title Improved Parallel Application Performance and Makespan by Colocation and Topology-aware Process Mapping DOI 10.1109/ccgrid59990.2024.00023 Type Conference Proceeding Abstract Author Vardas I Pages 119-124 -
2023
Title A Quantitative Analysis of OpenMP Task Runtime Systems DOI 10.1007/978-3-031-31180-2_1 Type Book Chapter Author Hunold S Publisher Springer Nature Pages 3-18 -
2022
Title OMPICollTune: Autotuning MPI Collectives by Incremental Online Learning DOI 10.1109/pmbs56514.2022.00016 Type Conference Proceeding Abstract Author Hunold S Pages 123-128 -
2022
Title An Overhead Analysis of MPI Profiling and Tracing Tools DOI 10.1145/3526063.3535353 Type Conference Proceeding Abstract Author Hunold S Pages 5-13 Link Publication -
2022
Title mpisee: MPI Profiling for Communication and Communicator Structure DOI 10.1109/ipdpsw55747.2022.00092 Type Conference Proceeding Abstract Author Vardas I Pages 520-529 -
2022
Title Cache-oblivious Hilbert Curve-based Blocking Scheme for Matrix Transposition DOI 10.1145/3555353 Type Journal Article Author Alves J Journal ACM Transactions on Mathematical Software Pages 1-28 Link Publication -
2021
Title MicroBench Maker: Reproduce, Reuse, Improve DOI 10.1109/pmbs54543.2021.00013 Type Conference Proceeding Abstract Author Hunold S Pages 69-74
-
2022
Link
Title Dataset: An Overhead Analysis of MPI Profiling and Tracing Tools DOI 10.5281/zenodo.6535636 Type Database/Collection of data Public Access Link Link
-
2024
Link
Title Exploring Scalability in C++ Parallel STL Implementations - ICPP 2024 Artifact DOI 10.5281/zenodo.12187770 Link Link
-
2022
Title Best Short Paper Award Type Research prize Level of Recognition Continental/International