Publications
Title: OpenCUBE: Building an Open Source Cloud Blueprint with EPI Systems
Authors: Ivy Peng, Martin Schultz, Utz-Uwe Haus, Craig Prunty, Pedro Marcuello, Emanuele Danovaro, Gabin Schieffer, Jacob Wahlgren, Daniel Medeiros, Philipp Friese and Stefano Markidis.
Published: Euro-Par ’23: Proceedings of the European Conference on Parallel Processing (Lecture Notes in Computer Science). August 2023.
Abstract: OpenCUBE aims to develop an open-source full software stack for Cloud computing blueprint deployed on EPI hardware, adaptable to emerging workloads across the computing continuum. OpenCUBE prioritizes energy awareness and utilizes open APIs, Open Source components, advanced SiPearl Rhea processors, and RISC-V accelerator. The project leverages representative workloads, such as cloud-native workloads and workflows of weather forecast data management, molecular docking, and space weather, for evaluation and validation.
Download (arXiv.org)
Title: Accelerating drug discovery in autodock-gpu with tensor cores
Authors: Gabin Schieffer, Ivy Peng
Published: Euro-Par ’23: Proceedings of the European Conference on Parallel Processing (Lecture Notes in Computer Science). August 2023.
Abstract: In drug discovery, molecular docking aims at characterizing the binding of a drug-like molecule to a macromolecule. AutoDock-GPU, a state-of-the-art docking software, estimates the geometrical conformation of a docked ligand-protein complex by minimizing a scoring function. Our profiling results indicate that the current reduction operation that is heavily used in the scoring function is sub-optimal. Thus, we developed a method to accelerate the sum reduction of four-element vectors using matrix operations on NVIDIA Tensor Cores. We integrated the new reduction operation into AutoDock-GPU and evaluated it on multiple chemical complexes on three GPUs. Our results show that our method for reduction operation is 4–7 times faster than the AutoDock-GPU baseline. We also evaluated the impact of our method on the overall simulation time in the real-world docking simulation and achieved a 27% improvement on the average docking time.
Download (arXiv.org)
Title: A Quantitative Approach for Adoption Disaggregated Memory in HPC Systems
Authors: Jacob Wahlgren, Gabin Schieffer, Maya Gokhale, Ivy Peng
Published: SC ’23: Proceedings of the SC ’22 of The International Conference on High Performance Computing, Network, Storage, and Analysis. November 2023.
Abstract: Memory disaggregation has recently been adopted in data centers to improve resource utilization, motivated by cost and sustainability. Recent studies on large-scale HPC facilities have also highlighted memory underutilization. A promising and non-disruptive option for memory disaggregation is rack-scale memory pooling, where node-local memory is supplemented by shared memory pools. This work outlines the prospects and requirements for adoption and clarifies several misconceptions. We propose a quantitative method for dissecting application requirements on the memory system from the top down in three levels, moving from general, to multi-tier memory systems, and then to memory pooling. We provide a multi-level profiling tool and LBench to facilitate the quantitative approach. We evaluate a set of representative HPC workloads on an emulated platform. Our results show that prefetching activities can significantly influence memory traffic profiles. Interference in memory pooling has varied impacts on applications, depending on their access ratios to memory tiers and arithmetic intensities. Finally, in two case studies, we show the benefits of our findings at the application and system levels, achieving 50% reduction in remote access and 13% speedup in BFS, and reducing performance variation of co-located workloads in interference-aware job scheduling.
Download (ACM Open Access)
Title: Survey of adaptive containerization architectures for HPC
Authors: Nina Mujkanovic, Juan J. Durillo, Nicolay Hammer, Tiziano Müller
Published: SC-W ’23: Proceedings of the SC ’23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis. November 2023. Pages 165–176.
Abstract: Containers offer an array of advantages that benefit research reproducibility and portability. As container tools mature, container security improves, and high-performance computing (HPC) and cloud system tools converge, supercomputing centers are increasingly integrating containers into their workflows. Despite this, most research into containers remains focused on cloud environments. We consider an adaptive containerization architecture approach, in which each component chosen represents the tool best adapted to the given system and site requirements, with a focus on accelerating the deployment of applications and workflows on HPC systems using containers. To this end, we discuss the HPC specific requirements regarding container tools, and analyze the entire containerization stack, including container engines and registries, in-depth. Finally, we consider various orchestrator and HPC workload manager integration scenarios, including Workload Manager (WLM) in Kubernetes, Kubernetes in WLM, and bridged scenarios. We present a proof-of-concept approach to a Kubernetes Agent in a WLM allocation.
Download (arXiv)
Authors: Daniel Medeiros, Gabin Schieffer, Jacob Wahlgren, Ivy Peng
Published: IEEE 35th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD).
Abstract: The conventional model of resource allocation in HPC systems is static. Thus, a job cannot leverage newly available resources in the system or release underutilized resources during the execution. In this paper, we present Kub, a methodology that enables elastic execution of HPC workloads on Kubernetes so that the resources allocated to a job can be dynamically scaled during the execution. One main optimization of our method is to maximize the reuse of the originally allocated resources so that the disruption to the running job can be minimized. The scaling procedure is coordinated among nodes through remote procedure calls on Kubernetes for deploying workloads in the cloud. We evaluate our approach using one synthetic benchmark and two production-level MPI-based HPC applications – GRO-MACS and CM1. Our results demonstrate that the benefits of adapting the allocated resources depend on the workload characteristics. In the tested cases, a properly chosen scaling point for increasing resources during execution achieved up to 2x speedup. Also, the overhead of checkpointing and data reshuffling significantly influences the selection of optimal scaling points and requires application-specific knowledge.
Download (arXiv)
Title: A GPU-Accelerated Molecular Docking Workflow with Kubernetes and Apache Airflow
Authors: Daniel Medeiros, Gabin Schieffer, Jacob Wahlgren, Ivy Peng
Published: International Conference on High Performance Computing. ISC High Performance 2023: High Performance Computing pp 193–206.
Abstract: Complex workflows play a critical role in accelerating scientific discovery. In many scientific domains, efficient workflow management can lead to faster scientific output and broader user groups. Workflows that can leverage resources across the boundary between cloud and HPC are a strong driver for the convergence of HPC and cloud. This study investigates the transition and deployment of a GPU-accelerated molecular docking workflow that was designed for HPC systems onto a cloud-native environment with Kubernetes and Apache Airflow. The case study focuses on state-of-of-the-art molecular docking software for drug discovery. We provide a DAG-based implementation in Apache Airflow and technical details for GPU-accelerated deployment. We evaluated the workflow using the SWEETLEAD bioinformatics dataset and executed it in a Cloud environment with heterogeneous computing resources. Our workflow can effectively overlap different stages when mapped onto different computing resources.
Download (arXiv)
Authors: Philipp A. Friese, Martin Schulz
Published: Architecture of Computing Systems (ARCS 2024). Lecture Notes in Computer Science,volume 14842, pp 221–235.
Abstract: The continuously rising number of cores per socket puts a growing demand on on-chip interconnects. The topology of these interconnects are largely kept hidden from the user, yet, they can be the source of measurable performance differences for large many-core processors due to core placement on that interconnect. This paper investigates the ARM Coherent Mesh Network (CMN) on an Ampere Altra Max processor. We provide novel insights into the interconnect by experimentally deriving key information on the CMN topology, such as the position of cores or memory and cache controllers. Based on this insight, we evaluate the performance characteristics of several benchmarks and tune the thread-to-core mapping to improve application performance. Our methodology is directly applicable to all ARM-based processors using the ARM CMN, but in principle applies to all mesh-based on-chip networks.
Download (Springer Open Access)
Title: Understanding Layered Portability from HPC to Cloud in Containerized Environments
Authors: Daniel Medeiros, Gabin Schieffer, Jacob Wahlgren, Ivy Peng
Published: International Workshop on Converged Computing on Edge, Cloud, and HPC (WOCC’ 24). ISC Workshops 2024. Springer.
Abstract: Recent development in lightweight OS-level virtualization,containers, provides a potential solution for running HPC applications on the cloud platform. In this work, we focus on the impact of different layers in a containerized environment when migrating HPC containers from a dedicated HPC system to a cloud platform. On three ARM-based platforms, including the latest Nvidia Grace CPU, we use six representative HPC applications to characterize the impact of container virtualization, host OS and kernel, and rootless and privileged container execution. Our results indicate less than 4% container overhead in DGEMM, miniMD,and XSBench, but 8%-10% overhead in FFT, HPCG, and Hypre. We also show that changing between the container execution modes results in negligible performance differences in the six applications.
Download (arXiv)
Title: Autonomy Loops for Monitoring, Operational Data Analytics, Feedback and Response in HPC Operations
Authors: Francieli Boito, Jim Brandt, Valeria Cardellini, Philip Carns, Florina M. Ciorba, Hilary Egan, Ahmed Eleliemy, Ann Gentile, Thomas Gruber, Jeff Hanson, Utz-Uwe Haus, Kevin Huck, Thomas Ilsche, Thomas Jakobsche, Terry Jones, Sven Karlsson, Abdullah Mueen, Michael Ott, Tapasya Patki, Ivy Peng, Krishnan Raghavan, Stephen Simms, Kathleen Shoga, Michael Showerman, Devesh Tiwari, Torsten Wilde, Keiji Yamamoto
Published: 2023 IEEE International Conference on Cluster Computing Workshops (CLUSTER Workshops).
Abstract: Many High Performance Computing (HPC) facilities have developed and deployed frameworks in support of continuous monitoring and operational data analytics (MODA) to help improve efficiency and throughput. Because of the complexity and scale of systems and workflows and the need for low-latency response to address dynamic circumstances, automated feedback and response have the potential to be more effective than current human-in-the-loop approaches which are laborious and error prone. Progress has been limited, however, by factors such as the lack of infrastructure and feedback hooks, and successful deployment is often site- and case-specific. In this position paper we report on the outcomes and plans from a recent Dagstuhl Seminar, seeking to carve a path for community progress in thedevelopment of autonomous feedback loops for MODA, based on the established formalism of similar (MAPE-K) loops inautonomous computing and self-adaptive systems. By definingand developing such loops for significant cases experienced across HPC sites, we seek to extract commonalities and develop conventions that will facilitate interoperability and interchangeability with system hardware, software, and applications across differentsites, and will motivate vendors and others to provide telemetry interfaces and feedback hooks to enable community development and pervasive deployment of MODA autonomy loops.
Download (arXiv)
Title: Closing the HPC-Cloud Convergence Gap: Multi-Tenant Slingshot RDMA for Kubernetess
Authors:
Philipp A. Friese, Ahmed Eleliemy, Utz-Uwe Haus, Martin Schulz
Published: 2025 IEEE International Conference on Cluster Computing (CLUSTER) (pp. 1-10). IEEE.
Abstract:
Converged HPC-Cloud computing is an emerging computing paradigm that aims to support increasingly complex and multi-tenant scientific workflows. These systems require reconciliation of the isolation requirements of native cloud workloads and the performance demands of HPC applications. In this context, networking hardware is a critical boundary component: it is the conduit for high-throughput, low-latency communication and enables isolation across tenants. HPE Sling- shot is a high-speed network interconnect that provides up to 200 Gbps of throughput per port and targets high-performance computing (HPC) systems. The Slingshot host software, including hardware drivers and network middleware libraries, is designed to meet HPC deployments, which predominantly use single- tenant access modes. Hence, the Slingshot stack is not suited for secure use in multi-tenant deployments, such as converged HPC-Cloud deployments. In this paper, we design and implement an extension to the Slingshot stack targeting converged deployments on the basis of Kubernetes. Our integration provides secure, container-granular, and multi-tenant access to Slingshot RDMA networking capabilities at minimal overhead.
Download (arXiv)
Title: ARC-V: Vertical Resource Adaptivity for HPC Workloads in Containerized Environments
Authors:
Daniel Medeiros, Jeremy J Williams, Jacob Wahlgren, Leonardo Saud Maia Leite, Ivy Peng
Published: 2025 European Conference on Parallel Processing (EuroPar) (pp. 175-189).
Abstract:
Existing state-of-the-art vertical autoscalers for containerized environments are traditionally built for cloud applications, which might behave differently than HPC workloads with their dynamic resource consumption. In these environments, autoscalers may create an inefficient resource allocation. This work analyzes nine representative HPC applications with different memory consumption patterns. Our results identify the limitations and inefficiencies of the Kubernetes Vertical Pod Autoscaler (VPA) for enabling memory elastic execution of HPC applications. We propose, implement, and evaluate ARC-V. This policy leverages both in-flight resource updates of pods in Kubernetes and the knowledge of memory consumption patterns of HPC applications for achieving elastic memory resource provisioning at the node level. Our results show that ARC-V can effectively save memory while eliminating out-of-memory errors compared to the standard Kubernetes VPA.
Download (arXiv)
Title: Application-Focused HPC Network Monitoring
Authors:
Philipp A. Friese; Olivier Marsden; Martin Schulz
Published: ISC 40th International Conference on High Performance Computing
Abstract:
The network hardware used in High-Performance Computing (HPC) systems is one of the core differentiators to regular compute clusters and has seen substantial improvements over the years. However, to fully utilize its capabilities, careful application tuning is necessary, which in turn is only possible if insights into application behavior can be obtained. With the rise of increasingly complex HPC application software stacks, this requires new kinds of fine-grained network monitoring capabilities. While regular TCP/IP based traffic can be monitored using a large variety of existing tools, for Remote Direct Memory Access (RDMA) based communication, which is prevalent in HPC networks, only very few tools are available. Further, those few tools either largely depend on features of specific network hardware, are mostly limited to node-granular monitoring only, or are limited to specific programming models. In this paper, we introduce a novel network monitoring tool that directly layers on a portable network abstraction library, namely libfabric, while enabling application specific monitoring. With that, our tool provides network hardware and programming model agnostic, per-process monitoring of RDMA-based network utilization and is capable of always-on, system-wide monitoring of production environments. We conduct a detailed overhead analysis on several state-of-the-art HPC systems with a variety of network hardware and show that our monitoring tool induces low overhead in the monitored application. In addition, we apply our monitoring tool to a real-world HPC application running in a production environment.
Title: ARM SVE Unleashed: Performance and Insights Across HPC Applications on Nvidia Grace
Authors:
Ruimin Shi, Gabin Schieffer, Maya Gokhale, Pei-Hung Lin, Hiren Patel, Ivy Peng
Published: 2025 European Conference on Parallel Processing (EuroPar) (pp. 33-47).
Abstract:
Vector architectures are essential for boosting computing throughput. ARM provides SVE as the next-generation length-agnostic vector extension beyond traditional fixed-length SIMD. This work provides a first study of the maturity and readiness of exploiting ARM and SVE in HPC. Using selected performance hardware events on the ARM Grace processor and analytical models, we derive new metrics to quantify the effectiveness of exploiting SVE vectorization to reduce executed instructions and improve performance speedup. We further propose an adapted roofline model that combines vector length and data elements to identify potential performance bottlenecks. Finally, we propose a decision tree for classifying the SVE-boosted performance in applications.
Download (arXiv)
