Performance Evaluation of NAS Parallel and High-Performance Conjugate Gradient Benchmarks in Mahameru

Taufiq Wirahman; Arnida L Latifah; Furqon Hensan Muttaqien; I Wayan Aditya Swardiana; Andria Arisal; Syam Budi Iryanto; Rifki Sadikin

doi:10.15575/join.v10i2.1557

Authors

Taufiq Wirahman Research Center for Computing, National Research and Innovation Agency, Bogor, Indonesia
Arnida L Latifah Research Center for Computing, National Research and Innovation Agency, Bogor and Informatics Study Program, School of Computing, Telkom University, Bandung, Indonesia
Furqon Hensan Muttaqien Research Center for Computing, National Research and Innovation Agency, Bogor and Information System Study Program, School of Applied Sciences, Telkom University, Bandung, Indonesia
I Wayan Aditya Swardiana Research Center for Computing, National Research and Innovation Agency, Bogor, Indonesia
Andria Arisal Research Center for Data and Information Sciences, National Research and Innovation Agency, Bandung, Indonesia
Syam Budi Iryanto Research Center for Computing, National Research and Innovation Agency, Bogor, Indonesia
Rifki Sadikin Research Center for Computing, National Research and Innovation Agency, Bogor, Indonesia

DOI:

https://doi.org/10.15575/join.v10i2.1557

Keywords:

Conjugate Gradient Algorithm, High-Performance Computing, MPI vs OpenMP , Supercomputing Performance , Parallel Computing

Abstract

High-Performance Computing (HPC) plays a crucial role in accelerating scientific advancement across numerous fields of research and in effectively implementing various complex scientific applications. Mahameru is one of the largest national HPC systems in Indonesia and has been utilized by many sectors. However, it has not undergone proper benchmarking evaluation, which is vital for identifying issues related to hardware and software configurations and confirming system reliability. Therefore, this study aims to evaluate the performance, efficiency, and capabilities of Mahameru. We present a benchmarking system on Mahameru utilizing two benchmark suites: the NAS Parallel Benchmarks (NPB) and the high-performance conjugate gradient (HPCG) benchmark. Our results indicate that the NPB exhibits a lower speedup in Message Passing Interface (MPI) compared to OpenMP, which can be attributed to the communication overhead and the nature of the computational tasks. Additionally, the HPCG benchmark demonstrates that Mahameru performance can compete with the lower tiers of the Top 500 supercomputers. When operating at full capacity, Mahameru can achieve approximately 2.5% of its theoretical peak performance. While the system generally performs reliably with parallel algorithms, it may not fully leverage hyperthreading with certain algorithms. This benchmark result can serve as a basis for decision-making regarding potential upgrades or changes to a system.

References

[1] G. Xie and Y.-H. Xiao, “How to Benchmark Supercomputers,” in 2015 14th International Symposium on Distributed Computing and Applications for Business Engineering and Science (DCABES), Guiyang, China: IEEE, Aug. 2015, pp. 364–367. doi: 10.1109/dcabes.2015.98.

[2] E. Strohmaier and H. Shan, “Apex-Map: A Synthetic Scalable Benchmark Probe to Explore Data Access Performance on Highly Parallel Systems,” in Euro-Par 2005 Parallel Processing, J. C. Cunha and P. D. Medeiros, Eds., Berlin, Heidelberg: Springer Berlin Heidelberg, 2005, pp. 114–123.

[3] S. Faulk, J. Gustafson, P. Johnson, A. Porter, W. Tichy, and L. Votta, “Measuring High Performance Computing Productivity,” The International Journal of High Performance Computing Applications, vol. 18, no. 4, pp. 459–473, Nov. 2004, doi: 10.1177/1094342004048539.

[4] M. Hao, W. Zhang, Y. Zhang, M. Snir, and L. T. Yang, “Automatic generation of benchmarks for I/O-intensive parallel applications,” Journal of Parallel and Distributed Computing, vol. 124, pp. 1–13, Feb. 2019, doi: 10.1016/j.jpdc.2018.10.004.

[5] Y. Liu et al., “623 Tflop/s HPCG run on Tianhe-2: Leveraging millions of hybrid cores,” The International Journal of High Performance Computing Applications, vol. 30, no. 1, pp. 39–54, Feb. 2016, doi: 10.1177/1094342015616266.

[6] M. Elshambakey, A. I. Maiyza, M. S. Kashkoush, G. M. Fathy, and H. A. Hassan, “The Egyptian national HPC grid (EN-HPCG): open-source Slurm implementation from cluster to grid approach,” J Supercomput, vol. 80, no. 12, pp. 16795–16823, Aug. 2024, doi: 10.1007/s11227-024-06041-9.

[7] S. Varrette, H. Cartiaux, S. Peter, E. Kieffer, T. Valette, and A. Olloh, “Management of an Academic HPC & Research Computing Facility: The ULHPC Experience 2.0,” in 2022 6th High Performance Computing and Cluster Technologies Conference (HPCCT), Fuzhou China: ACM, Jul. 2022. doi: 10.1145/3560442.3560445.

[8] D. Zivanovic et al., “Main Memory in HPC: Do We Need More or Could We Live with Less?,” ACM Trans. Archit. Code Optim., vol. 14, no. 1, pp. 1–26, Mar. 2017, doi: 10.1145/3023362.

[9] M. Sato, Y. Kodama, M. Tsuji, and T. Odajima, “Co-Design and System for the Supercomputer ‘Fugaku,’” IEEE Micro, vol. 42, no. 2, pp. 26–34, Mar. 2022, doi: 10.1109/mm.2021.3136882.

[10] T. Aoyama, I. Kanamori, K. Kanaya, H. Matsufuru, and Y. Namekawa, “Bridge++ 2.0: Benchmark results on supercomputer Fugaku,” 2023, doi: 10.48550/ARXIV.2303.05883.

[11] “Performance of the Supercomputer Fugaku for Breadth-First Search in Graph500 Benchmark,” in Lecture Notes in Computer Science, Cham: Springer International Publishing, 2021, pp. 372–390. doi: 10.1007/978-3-030-78713-4_20.

[12] “Optimized Implementation of the HPCG Benchmark on Reconfigurable Hardware,” in Lecture Notes in Computer Science, Cham: Springer International Publishing, 2021, pp. 616–630. doi: 10.1007/978-3-030-85665-6_38.

[13] “Understanding HPC Benchmark Performance on Intel Broadwell and Cascade Lake Processors,” in Lecture Notes in Computer Science, Cham: Springer International Publishing, 2020, pp. 412–433. doi: 10.1007/978-3-030-50743-5_21.

[14] A. Fuchs, J. Squar, and M. Kuhn, “Ensemble-Based System Benchmarking for HPC,” in 2024 23rd International Symposium on Parallel and Distributed Computing (ISPDC), Chur, Switzerland: IEEE, Jul. 2024, pp. 1–8. doi: 10.1109/ispdc62236.2024.10705405.

[15] “An I/O Analysis of HPC Workloads on CephFS and Lustre,” in Lecture Notes in Computer Science, Cham: Springer International Publishing, 2019, pp. 300–316. doi: 10.1007/978-3-030-34356-9_24.

[16] D. G. Chester, S. A. Wright, and S. A. Jarvis, “Understanding Communication Patterns in HPCG,” Electronic Notes in Theoretical Computer Science, vol. 340, pp. 55–65, Oct. 2018, doi: 10.1016/j.entcs.2018.09.005.

[17] D. H. Bailey et al., “The Nas Parallel Benchmarks,” The International Journal of Supercomputing Applications, vol. 5, no. 3, pp. 63–73, Sep. 1991, doi: 10.1177/109434209100500306.

[18] J. Dongarra, M. A. Heroux, and P. Luszczek, “High-performance conjugate-gradient benchmark: A new metric for ranking high-performance computing systems,” The International Journal of High Performance Computing Applications, vol. 30, no. 1, pp. 3–10, Feb. 2016, doi: 10.1177/1094342015593158.

[19] “SLURM: Simple Linux Utility for Resource Management,” in Lecture Notes in Computer Science, Berlin, Heidelberg: Springer Berlin Heidelberg, 2003, pp. 44–60. doi: 10.1007/10968987_3.

[20] L. V. Kalé et al., “NAS Parallel Benchmarks,” in Encyclopedia of Parallel Computing, D. Padua, Ed., Boston, MA: Springer US, 2011, pp. 1254–1259. doi: 10.1007/978-0-387-09766-4_133.

[21] D. A. Mallon, G. L. Taboada, J. Tourino, and R. Doallo, “NPB-MPJ: NAS Parallel Benchmarks Implementation for Message-Passing in Java,” in 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing, Weimar: IEEE, Feb. 2009, pp. 181–190. doi: 10.1109/PDP.2009.59.

[22] C. A. Navarro, N. Hitschfeld-Kahler, and L. Mateu, “A Survey on Parallel Computing and its Applications in Data-Parallel Problems Using GPU Architectures,” Commun. comput. phys., vol. 15, no. 2, pp. 285–329, Feb. 2014, doi: 10.4208/cicp.110113.010813a.

[23] J.-C. Régin, M. Rezgui, and A. Malapert, “Embarrassingly Parallel Search,” in Principles and Practice of Constraint Programming, vol. 8124, C. Schulte, Ed., in Lecture Notes in Computer Science, vol. 8124. , Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp. 596–610. doi: 10.1007/978-3-642-40627-0_45.

[24] J. Löff et al., “The NAS Parallel Benchmarks for evaluating C++ parallel programming frameworks on shared-memory architectures,” Future Generation Computer Systems, vol. 125, pp. 743–757, Dec. 2021, doi: 10.1016/j.future.2021.07.021.

[25] “A CUDA Implementation of the High Performance Conjugate Gradient Benchmark,” in Lecture Notes in Computer Science, Cham: Springer International Publishing, 2015, pp. 68–84. doi: 10.1007/978-3-319-17248-4_4.

[26] H. Lu, S. Dwarkadas, A. L. Cox, and W. Zwaenepoel, “Quantifying the Performance Differences between PVM and TreadMarks,” Journal of Parallel and Distributed Computing, vol. 43, no. 2, pp. 65–78, Jun. 1997, doi: 10.1006/jpdc.1997.1332.