Publications

OpenCUBE: Building an Open Source Cloud Blueprint with EPI Systems

Title: OpenCUBE: Building an Open Source Cloud Blueprint with EPI Systems

Authors: Ivy Peng, Martin Schultz, Utz-Uwe Haus, Craig Prunty, Pedro Marcuello, Emanuele Danovaro, Gabin Schieffer, Jacob Wahlgren, Daniel Medeiros, Philipp Friese and Stefano Markidis.

Published: Euro-Par ’23: Proceedings of the European Conference on Parallel Processing (Lecture Notes in Computer Science). August 2023.

Abstract: OpenCUBE aims to develop an open-source full software stack for Cloud computing blueprint deployed on EPI hardware, adaptable to emerging workloads across the computing continuum. OpenCUBE prioritizes energy awareness and utilizes open APIs, Open Source components, advanced SiPearl Rhea processors, and RISC-V accelerator. The project leverages representative workloads, such as cloud-native workloads and workflows of weather forecast data management, molecular docking, and space weather, for evaluation and validation.

Download (arXiv.org)

Accelerating drug discovery in autodock-gpu with tensor cores

Title: Accelerating drug discovery in autodock-gpu with tensor cores

Authors: Gabin Schieffer, Ivy Peng

Published: Euro-Par ’23: Proceedings of the European Conference on Parallel Processing (Lecture Notes in Computer Science). August 2023.

Abstract: In drug discovery, molecular docking aims at characterizing the binding of a drug-like molecule to a macromolecule. AutoDock-GPU, a state-of-the-art docking software, estimates the geometrical conformation of a docked ligand-protein complex by minimizing a scoring function. Our profiling results indicate that the current reduction operation that is heavily used in the scoring function is sub-optimal. Thus, we developed a method to accelerate the sum reduction of four-element vectors using matrix operations on NVIDIA Tensor Cores. We integrated the new reduction operation into AutoDock-GPU and evaluated it on multiple chemical complexes on three GPUs. Our results show that our method for reduction operation is 4–7 times faster than the AutoDock-GPU baseline. We also evaluated the impact of our method on the overall simulation time in the real-world docking simulation and achieved a 27% improvement on the average docking time.

Download (arXiv.org)

A Quantitative Approach for Adopting Disaggregated Memory in HPC Systems

Title: A Quantitative Approach for Adoption Disaggregated Memory in HPC Systems

Authors: Jacob Wahlgren, Gabin Schieffer, Maya Gokhale, Ivy Peng

Published: SC ’23: Proceedings of the SC ’22 of The International Conference on High Performance Computing, Network, Storage, and Analysis. November 2023.

Abstract: Memory disaggregation has recently been adopted in data centers to improve resource utilization, motivated by cost and sustainability. Recent studies on large-scale HPC facilities have also highlighted memory underutilization. A promising and non-disruptive option for memory disaggregation is rack-scale memory pooling, where node-local memory is supplemented by shared memory pools. This work outlines the prospects and requirements for adoption and clarifies several misconceptions. We propose a quantitative method for dissecting application requirements on the memory system from the top down in three levels, moving from general, to multi-tier memory systems, and then to memory pooling. We provide a multi-level profiling tool and LBench to facilitate the quantitative approach. We evaluate a set of representative HPC workloads on an emulated platform. Our results show that prefetching activities can significantly influence memory traffic profiles. Interference in memory pooling has varied impacts on applications, depending on their access ratios to memory tiers and arithmetic intensities. Finally, in two case studies, we show the benefits of our findings at the application and system levels, achieving 50% reduction in remote access and 13% speedup in BFS, and reducing performance variation of co-located workloads in interference-aware job scheduling.

Download (ACM Open Access)

Survey of adaptive containerization architectures for HPC

Title: Survey of adaptive containerization architectures for HPC

Authors: Nina Mujkanovic, Juan J. Durillo, Nicolay Hammer, Tiziano Müller

Published: SC-W ’23: Proceedings of the SC ’23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis. November 2023. Pages 165–176.

Abstract: Containers offer an array of advantages that benefit research reproducibility and portability. As container tools mature, container security improves, and high-performance computing (HPC) and cloud system tools converge, supercomputing centers are increasingly integrating containers into their workflows. Despite this, most research into containers remains focused on cloud environments. We consider an adaptive containerization architecture approach, in which each component chosen represents the tool best adapted to the given system and site requirements, with a focus on accelerating the deployment of applications and workflows on HPC systems using containers. To this end, we discuss the HPC specific requirements regarding container tools, and analyze the entire containerization stack, including container engines and registries, in-depth. Finally, we consider various orchestrator and HPC workload manager integration scenarios, including Workload Manager (WLM) in Kubernetes, Kubernetes in WLM, and bridged scenarios. We present a proof-of-concept approach to a Kubernetes Agent in a WLM allocation.

Download (arXiv)

Kub: Enabling Elastic HPC Workloads on Containerized Environments

Authors: Daniel Medeiros, Gabin Schieffer, Jacob Wahlgren, Ivy Peng

Published: IEEE 35th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD).

Abstract: The conventional model of resource allocation in HPC systems is static. Thus, a job cannot leverage newly available resources in the system or release underutilized resources during the execution. In this paper, we present Kub, a methodology that enables elastic execution of HPC workloads on Kubernetes so that the resources allocated to a job can be dynamically scaled during the execution. One main optimization of our method is to maximize the reuse of the originally allocated resources so that the disruption to the running job can be minimized. The scaling procedure is coordinated among nodes through remote procedure calls on Kubernetes for deploying workloads in the cloud. We evaluate our approach using one synthetic benchmark and two production-level MPI-based HPC applications – GRO-MACS and CM1. Our results demonstrate that the benefits of adapting the allocated resources depend on the workload characteristics. In the tested cases, a properly chosen scaling point for increasing resources during execution achieved up to 2x speedup. Also, the overhead of checkpointing and data reshuffling significantly influences the selection of optimal scaling points and requires application-specific knowledge.

Download (arXiv)

A GPU-Accelerated Molecular Docking Workflow with Kubernetes and Apache Airflow

Title: A GPU-Accelerated Molecular Docking Workflow with Kubernetes and Apache Airflow

Authors: Daniel Medeiros, Gabin Schieffer, Jacob Wahlgren, Ivy Peng

Published: International Conference on High Performance Computing. ISC High Performance 2023: High Performance Computing pp 193–206.

Abstract: Complex workflows play a critical role in accelerating scientific discovery. In many scientific domains, efficient workflow management can lead to faster scientific output and broader user groups. Workflows that can leverage resources across the boundary between cloud and HPC are a strong driver for the convergence of HPC and cloud. This study investigates the transition and deployment of a GPU-accelerated molecular docking workflow that was designed for HPC systems onto a cloud-native environment with Kubernetes and Apache Airflow. The case study focuses on state-of-of-the-art molecular docking software for drug discovery. We provide a DAG-based implementation in Apache Airflow and technical details for GPU-accelerated deployment. We evaluated the workflow using the SWEETLEAD bioinformatics dataset and executed it in a Cloud environment with heterogeneous computing resources. Our workflow can effectively overlap different stages when mapped onto different computing resources.

Download (arXiv)

Exploring the ARM Coherent Mesh Network Topology

Authors: Philipp A. Friese, Martin Schulz

Published: Architecture of Computing Systems (ARCS 2024). Lecture Notes in Computer Science,volume 14842, pp 221–235.

Abstract: The continuously rising number of cores per socket puts a growing demand on on-chip interconnects. The topology of these interconnects are largely kept hidden from the user, yet, they can be the source of measurable performance differences for large many-core processors due to core placement on that interconnect. This paper investigates the ARM Coherent Mesh Network (CMN) on an Ampere Altra Max processor. We provide novel insights into the interconnect by experimentally deriving key information on the CMN topology, such as the position of cores or memory and cache controllers. Based on this insight, we evaluate the performance characteristics of several benchmarks and tune the thread-to-core mapping to improve application performance. Our methodology is directly applicable to all ARM-based processors using the ARM CMN, but in principle applies to all mesh-based on-chip networks.

Download (Springer Open Access)

Understanding Layered Portability from HPC to Cloud in Containerized Environments

Title: Understanding Layered Portability from HPC to Cloud in Containerized Environments

Authors: Daniel Medeiros, Gabin Schieffer, Jacob Wahlgren, Ivy Peng

Published: International Workshop on Converged Computing on Edge, Cloud, and HPC (WOCC’ 24). ISC Workshops 2024. Springer.

Abstract: Recent development in lightweight OS-level virtualization,containers, provides a potential solution for running HPC applications on the cloud platform. In this work, we focus on the impact of different layers in a containerized environment when migrating HPC containers from a dedicated HPC system to a cloud platform. On three ARM-based platforms, including the latest Nvidia Grace CPU, we use six representative HPC applications to characterize the impact of container virtualization, host OS and kernel, and rootless and privileged container execution. Our results indicate less than 4% container overhead in DGEMM, miniMD,and XSBench, but 8%-10% overhead in FFT, HPCG, and Hypre. We also show that changing between the container execution modes results in negligible performance differences in the six applications.

Download (arXiv)

Autonomy Loops for Monitoring, Operational Data Analytics, Feedback, and Response in HPC Operations

Title: Autonomy Loops for Monitoring, Operational Data Analytics, Feedback and Response in HPC Operations

Authors: Francieli Boito, Jim Brandt, Valeria Cardellini, Philip Carns, Florina M. Ciorba, Hilary Egan, Ahmed Eleliemy, Ann Gentile, Thomas Gruber, Jeff Hanson, Utz-Uwe Haus, Kevin Huck, Thomas Ilsche, Thomas Jakobsche, Terry Jones, Sven Karlsson, Abdullah Mueen, Michael Ott, Tapasya Patki, Ivy Peng, Krishnan Raghavan, Stephen Simms, Kathleen Shoga, Michael Showerman, Devesh Tiwari, Torsten Wilde, Keiji Yamamoto

Published: 2023 IEEE International Conference on Cluster Computing Workshops (CLUSTER Workshops).

Abstract: Many High Performance Computing (HPC) facilities have developed and deployed frameworks in support of continuous monitoring and operational data analytics (MODA) to help improve efficiency and throughput. Because of the complexity and scale of systems and workflows and the need for low-latency response to address dynamic circumstances, automated feedback and response have the potential to be more effective than current human-in-the-loop approaches which are laborious and error prone. Progress has been limited, however, by factors such as the lack of infrastructure and feedback hooks, and successful deployment is often site- and case-specific. In this position paper we report on the outcomes and plans from a recent Dagstuhl Seminar, seeking to carve a path for community progress in thedevelopment of autonomous feedback loops for MODA, based on the established formalism of similar (MAPE-K) loops inautonomous computing and self-adaptive systems. By definingand developing such loops for significant cases experienced across HPC sites, we seek to extract commonalities and develop conventions that will facilitate interoperability and interchangeability with system hardware, software, and applications across differentsites, and will motivate vendors and others to provide telemetry interfaces and feedback hooks to enable community development and pervasive deployment of MODA autonomy loops.

Download (arXiv)