Publications
With the increasing prevalence of machine learning and large language model (LLM) inference, heterogeneous computing has become essential. Modern JVMs are embracing this transition through projects such as TornadoVM and Babylon, which enable hardware acceleration on diverse hardware resources, including GPUs and FPGAs. However, while performance results are promising, developers currently face a significant tooling gap: traditional profilers excel at CPU-bound execution but become a “black box” when execution transitions to accelerators, providing no visibility into device memory management, execution patterns or cross-device data movement. This gap leaves developers without a unified view of how their Java applications behave across the heterogeneous computing stack.
In this paper, we present TornadoViz, a visual analytics tool that leverages TornadoVM’s specialized bytecode system to provide interactive analysis of heterogeneous execution and object lifecycles in managed runtime systems. Unlike existing tools, TornadoViz bridges the managed-native divide by interpreting the bytecode stream that orchestrates heterogeneous execution, hence connecting high-level application logic with low-level hardware utilization patterns. Our tool enables developers to visualize task dependencies, track memory operations across devices, analyze bytecode distribution patterns, and identify performance bottlenecks through interactive dashboards.
Published at:
MPLR 2025
Parallel programs are prone to data races, which are concurrency bugs that are difficult to track and reproduce. Various attempts have been made to create or incorporate tools that aim to dynamically detect data races in Java, but most rely on external race detectors that: a) miss some of the nuances in the Java Memory Model, b) are too slow and complicated to be used in complex real-world applications, or c) produce a lot of false positive reports. In this paper, we present MaTSa, a tool built within OpenJDK, that aims to dynamically detect data races and offer informative pointers to the origin of the race. We evaluate MaTSa and detect several races in the Renaissance benchmark suite and the Quarkus framework, many of which have been reported and resulted in upstream fixes. We compare MaTSa to Java TSan, the only current state-of-the-art dynamic race detector that works on recent OpenJDK versions. We analyze issues with false positives and false negatives for both tools and explain the design decisions causing them. We found MaTSa to be 15x faster on average, while scaling to large programs not supported by other tools.
Published at:
MPLR 2025
Dynamic and Static Code Analysis for Java Programs on Heterogeneous Hardware [Conference]
Here’s the cleaned and formatted list of names: Athanasios Stratikopoulos, Tianyu Zuo, Umut Sarp Harbalioglu, Juan Fumero, Michail Papadimitriou, Orion Papadakis, Maria Nektaria Xekalaki, Christos Kotselidis
The increasing prevalence of heterogeneous computing systems, incorporating accelerators like GPUs, has spurred the development of advanced frameworks to bring high performance capabilities to managed languages. TornadoVM is a state-of-the-art, open-source framework for accelerating Java programs. It enables Java applications to offload computation onto GPUs and other accelerators, thereby bridging the gap between the high-level abstractions of the Java Virtual Machine (JVM) and the low-level, performance-oriented world of parallel programming models, such as OpenCL and CUDA. However, this bridging comes with inherent tradeoffs. The semantic and operational mismatch between these two worlds – such as managed memory versus explicit memory control, or dynamic JIT compilation versus static kernel generation – TornadoVM to limit or exclude support for certain Java features. These limitations can hinder developer productivity and make it difficult to identify and resolve compatibility issues during development.
This paper introduces TornadoInsight, a tool that simplifies development with TornadoVM by detecting incompatible Java constructs through static and dynamic analysis. TornadoInsight is developed as an open-source IntelliJ IDEA plugin that provides immediate, source-linked feedback within the developer’s workflow. We present the architecture of TornadoInsight, detail its inspection mechanisms, and evaluate its effectiveness in improving the development workflow for TornadoVM users. TornadoInsight is publicly available and offers a practical solution for enhancing developer experience and productivity in heterogeneous managed runtime environments.
Published at:
MPLR 2025
Using remote memory for the Java heap enables big data analytics frameworks to process large datasets. However, the Java Virtual Machine (JVM) runtime struggles to maintain low network traffic during garbage collection (GC) and to reclaim space efficiently. To reduce GC cost in big data an- alytics, systems group long-lived objects into regions and excludes them from frequent GC scans, regardless of whether the heap resides in local or remote memory. Recent work uses a dual-heap design, placing short-lived objects in a local heap and long-lived objects in a remote region-based heap, limiting GC activity to the local heap. However, these sys- tems avoid scanning by reclaiming remote heap space only when regions are fully garbage, an inefficient strategy that delays reclamation and risks out-of-memory (OOM) errors. In this paper, we propose SmartSweep, a system that uses approximate liveness information to balance network traffic and space reclamation in remote heaps. SmartSweep adopts a dual-heap design and avoids scanning or compacting ob- jects in the remote heap. Instead, it estimates the amount of garbage in each region without accessing the remote heap and selectively transfers regions with many garbage objects back to the local heap for reclamation. Preliminary results with Spark and Neo4j show that SmartSweep achieves per- formance comparable to TeraHeap, which reclaims remote objects lazily, while reducing peak remote memory usage by up to 49% and avoiding OOM errors.
Published at:
MPLR 2025
This paper presents an approach to accelerate Java applications on RISC-V processors equipped with vector extensions. Our approach utilizes a two-stage compilation chain composed of two open-source compilation frameworks. The first compilation is performed by TornadoVM, a Java Framework that includes a Just-In-Time (JIT) compiler and a runtime system that translate Java Bytecode into OpenCL and SPIR-V. The second compilation is operated by the oneAPI Construction Kit (OCK), a programming framework that translates OpenCL and SPIR-V code into an efficient binary augmented with vector instructions for RISC-V CPUs. We also present a preliminary performance evaluation using matrix multiplication. Results demonstrate a substantial performance improvement in the code generated when compared against functionally equivalent single-threaded and multi-threaded Java implementations, achieving speedups up to 33x and 4.6x respectively.
Published at:
RISC-V Summit Europe, 2025
Network virtualization concepts are extending their influence to other fields, such as Satellite Communication. The WAVE consortium initiative is an example, as they aim to create a new SATCOM ecosystem with interoperable and hardware accelerated virtual functions. This paper presents SVFF+, a software/hardware framework that enables easy FPGA usage in virtualized environments, introducing partial reconfiguration and Kubernetes (K8S) support. SVFF+ was developed with WAVE consortium applications in mind, but it can be extended to any cloud computing or network virtualization use cases, such as NFV. We demonstrate the effectiveness of SVFF+ via benchmarks, showcasing its potential to accelerate network applications and cloud environments. SVFF+ enhances a previuos work (SVFF) with support for Kubernetes and partial reconfiguration.
Published at:
The 31st IEEE International Conference on Telecommunications - ICT 2025
Big data analytics frameworks, such as Spark and Giraph, need to process and cache massive datasets that do not always fit on the managed heap. Therefore, frameworks temporarily move long-lived objects outside the heap (off-heap) on a fast storage device. However, this practice results in (1) high serialization/deserialization (S/D) cost and (2) high memory pressure when off-heap objects are moved back for processing.
In this article, we propose TeraHeap, a system that eliminates S/D overhead and expensive GC scans for a large portion of objects in analytics frameworks. TeraHeap relies on three concepts: (1) It eliminates S/D by extending the managed runtime (JVM) to use a second high-capacity heap (H2) over a fast storage device. (2) It offers a simple hint-based interface, allowing analytics frameworks to leverage object knowledge to populate H2. (3) It reduces GC cost by fencing the collector from scanning H2 objects while maintaining the illusion of a single managed heap, ensuring memory safety.
We implement TeraHeap in OpenJDK8 and OpenJDK17 and evaluate it with fifteen widely used applications in two real-world big data frameworks, Spark and Giraph. We find that for the same DRAM size, TeraHeap improves performance by up to 73% and 28% compared to native Spark and Giraph. Also, it can still provide better performance by consuming up to 4.6× and 1.2× less DRAM than native Spark and Giraph, respectively. TeraHeap can also be used for in-memory frameworks and applying it to the Neo4j Graph Data Science library improves its performance by up to 26%. Finally, it outperforms Panthera, a state-of-the-art garbage collector for hybrid DRAM-NVM memories, by up to 69%.
Published at:
ACM Transactions on Programming Languages and Systems (TOPLAS), 2024
The Intrusion Detection System (IDS) is an effective tool utilized in cybersecurity systems to detect and identify intrusion attacks. With the increasing volume of data generation, the possibility of various forms of intrusion attacks also increases. Feature selection is crucial and often necessary to enhance performance. The structure of the dataset can impact the efficiency of the machine learning model. Furthermore, data imbalance can pose a problem, but sampling approaches can help mitigate it. This research aims to explore machine learning (ML) approaches for IDS, specifically focusing on datasets, machine algorithms, and metrics. Three datasets were utilized in this study: KDD 99, UNSW- NB15, and CSE-CIC-IDS 2018. Various machine learning algorithms were chosen and examined to assess IDS performance. The primary objective was to provide a taxonomy for interconnected intrusion detection systems and supervised machine learning algorithms. The selection of datasets is crucial to ensure the suitability of the model construction for IDS usage. The evaluation was conducted for both binary and multi-class classification to ensure the consistency of the selected ML algorithms for the given dataset. The experimental results demonstrated accuracy rates of 100% for binary classification and 99.4In conclusion, it can be stated that supervised machine learning algorithms exhibit high and promising classification performance based on the study of three popular datasets.
Published at:
Applied Sciences Journal, 2023
Large pages have been the de facto mitigation technique to address the translation overheads of virtual memory, with prior work mostly focusing on the large page sizes supported by the x86 architecture, i.e., 2MiB and 1GiB. ARMv8-A and RISC-V support additional intermediate translation sizes, i.e., 64KiB and 32MiB, via OS-assisted TLB coalescing, but their performance potential has largely fallen under the radar due to the limited system software support. In this paper, we propose Elastic Translations (ET), a holistic memory management solution, to fully explore and exploit the aforementioned translation sizes for both native and virtualized execution. ET implements mechanisms that make the OS memory manager coalescingaware, enabling the transparent and efficient use of intermediatesized translations. ET also employs policies to guide translation size selection at runtime using lightweight HW-assisted TLB miss sampling. We design and implement ET for ARMv8-A in Linux and KVM. Our real-system evaluation of ET shows that ET improves the performance of memory intensive workloads by up to 39% in native execution and by 30% on average in virtualized execution.
Published at:
57th IEEE/ACM International Symposium on Microarchitecture (MICRO’24)
With the proliferation of Serverless Computing, the Function-asa-Service (FaaS) paradigm is nowadays ubiquitous. As a result, the domain has attracted extensive research, both in industry and academia, identifying opportunities and addressing limitations across all aspects of this new Cloud paradigm. Recently, FaaS providers have released production workload traces of their commercial platforms. These expose important characteristics, such as the execution time of function invocations, their number and the distribution of their inter-arrival times, which must be taken into account for a concrete evaluation of innovative solutions. Nevertheless, the Serverless ecosystem still lacks a unified evaluation methodology based on such information. In this paper we attempt to fill this gap, by developing a methodology for fitting existing, real, open-source workloads found in FaaS benchmarking suites to production FaaS workload traces, in a way that sufficiently preserves the aforementioned core statistical properties of such traces. Based on this, we build FaaSRail, an opensource load generator that receives a target maximum request rate and a target total execution duration as inputs from the user and generates representative, scaled down FaaS load.
Published at:
33rd International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2024
The Standard Portable Intermediate Representation (SPIR- V) is a low-level binary format designed for representing shaders and compute kernels that can be consumed by OpenCL for computing kernels, and Vulkan for graphics rendering. As a binary representation, SPIR-V is meant to be used by compilers and runtime systems, and is usually performed by C/C++ programs and the LLVM software and compiler ecosystem. However, not all programming environments, runtime systems, and language implementations are C/C++ or based on LLVM.
This paper presents the Beehive SPIR-V Toolkit; a frame- work that can automatically generate a Java composable and functional library for dynamically building SPIR-V binary modules. The Beehive SPIR-V Toolkit can be used by optimizing compilers and runtime systems to generate and validate SPIR-V binary modules from managed runtime systems. Furthermore, our framework is architected to accommodate new SPIR-V releases in an easy-to-maintain manner, and it facilitates the automatic generation of Java libraries for other standards, besides SPIR-V. The Beehive SPIR-V Toolkit also includes an assembler that emits SPIR-V binary mod- ules from disassembled SPIR-V text files, and a disassembler that converts the SPIR-V binary code into a text file. To the best of our knowledge, the Beehive SPIR-V Toolkit is the first Java programming framework that can dynamically generate SPIR-V binary modules.
Published at:
2023 Workshop on Virtual Machines and Language Implementations (VMIL 2023)
Adopting heterogeneous execution on GPUs and FPGAs in managed runtime systems, such as Java, is a challenging task due to the complexities of the underlying virtual machine. The majority of current work has been focusing on compiler toolchains to solve the challenge of transparent just-in-time compilation of different code segments onto the accelerators. However, apart from providing automatic code generation, one of the challenges is also the seamless interoperability with the host memory manager and the Garbage Collector (GC). Currently, heterogeneous programming models on top of managed runtime systems, such as Aparapi and TornadoVM, need to block the GC when running native code (e.g, JNI code) in order to prevent the GC from moving data while the native code is still running on the hardware accelerator.
To tackle this challenge, this paper proposes a novel Unified Memory (UM) memory allocator for heterogeneous programming frameworks for managed runtime systems. In this paper we show how, by providing small changes to a Java runtime system, automatic memory management can be enhanced to perform object reclamation not only on the host, but on the device also. This is done by allocating the Java Virtual Machine’s object heap in unified memory which is visible to all hardware accelerators. In this manner, we enable transparent page migration of Java heap-allocated objects between the host and the accelerator, since our UM system is aware of pointers and object migration due to GC collections. This technique has been implemented in the context of MaxineVM, an open source research VM for Java written in Java. We evaluated our approach on a discrete and an integrated GPU, showcasing under which conditions UM can benefit execution across different benchmarks and configurations. Our results indicate that when hardware acceleration is not employed, UM does not pose significant overheads unless memory intensive workloads are encountered which can exhibit up to 12% (worst case) and 2% (average) slowdowns. In addition, if hardware acceleration is used, UM can achieve up to 9.3x speedup compared to the non-UM baseline implementation.
Published at:
20th International Conference on Managed Programming Languages & Runtimes (MPLR'23)
Java benchmarking suites like Dacapo and Renaissance are employed by the research community to evaluate the performance of novel features in managed runtime systems. These suites encompass various applications with diverse behaviors in order to stress test different subsystems of a managed runtime. Therefore, understanding and characterizing the behavior of these benchmarks is important when trying to interpret experimental results.
This paper presents an in-depth study of the memory behavior of 30 Dacapo and Renaissance applications. To realize the study, a characterization methodology based on a two-faceted profiling process of the Java applications is employed. The two-faceted profiling offers comprehensive insights into the memory behavior of Java applications, as it is composed of high-level and low-level metrics obtained through a Java object profiler (NUMAProfiler) and a microarchitectural event profiler (PerfUtil) of MaxineVM, respectively. By using this profiling methodology we classify the Dacapo and Renaissance applications regarding their intensity in object allocations, object accesses, LLC, and main memory pressure. In addition, several other aspects such as the JVM impact on the memory behavior of the application are discussed.
Published at:
20th International Conference on Managed Programming Languages & Runtimes (MPLR'23)
The Intrusion Detection System (IDS) is an effective tool utilized in cybersecurity systems to detect and identify intrusion attacks. With the increasing volume of data generation, the possibility of various forms of intrusion attacks also increases. Feature selection is crucial and often necessary to enhance performance. The structure of the dataset can impact the efficiency of the machine learning model. Furthermore, data imbalance can pose a problem, but sampling approaches can help mitigate it. This research aims to explore machine learning (ML) approaches for IDS, specifically focusing on datasets, machine algorithms, and metrics. Three datasets were utilized in this study: KDD 99, UNSW-NB15, and CSE-CIC-IDS 2018. Various machine learning algorithms were chosen and examined to assess IDS performance. The primary objective was to provide a taxonomy for interconnected intrusion detection systems and supervised machine learning algorithms. The selection of datasets is crucial to ensure the suitability of the model construction for IDS usage. The evaluation was conducted for both binary and multi-class classification to ensure the consistency of the selected ML algorithms for the given dataset. The experimental results demonstrated accuracy rates of 100% for binary classification and 99.4In conclusion, it can be stated that supervised machine learning algorithms exhibit high and promising classification performance based on the study of three popular datasets.
Published at:
Applied Sciences Journal
The address translation (AT) overhead has been widely studied in literature and the new 5-level paging is expected to make translation even costlier. Multiple solutions have been proposed to alleviate the issue either by reducing the number of TLB misses or by reducing their overhead. The solution widely adopted by industry involves extending the page sizes supported by the hardware and software, with the most common being 2MB and 1GB. We evaluate the usefulness of intermediate translation sizes, using memory-intensive work-loads running on an ARMv8-A server.
Published at:
18th European Conference on Computer System (EuroSys 2023)
Title lorem ipsum dolor sit amet.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. In pulvinar consectetur ex eu molestie. Aliquam at nisl lobortis, ornare mauris ut, rhoncus eros.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. In pulvinar consectetur ex eu molestie. Aliquam at nisl lobortis, ornare mauris ut, rhoncus eros. Morbi sed rhoncus purus. Nunc hendrerit lacus non turpis tristique, eu tempus augue dignissim. Morbi lacinia porta aliquam. Phasellus bibendum suscipit lobortis. Etiam nec lectus id tortor placerat feugiat. Sed tristique leo aliquet, faucibus mi at, pellentesque leo. Suspendisse lacus odio, varius non volutpat et, sollicitudin a turpis. Pellentesque lacus sem, rhoncus sed felis sed, bibendum suscipit dui. Fusce et nunc a leo hendrerit iaculis vitae in tortor. Praesent molestie sodales nisl, vel vehicula nunc pulvinar in. Ut porta cursus consectetur. Aliquam eget tortor turpis. Nunc pretium dolor eget erat pulvinar, ac maximus mauris ullamcorper.
Nunc varius cursus eros, nec sodales metus malesuada ut. Donec congue magna eget libero viverra ultricies. Maecenas in diam eu nulla placerat commodo id venenatis elit. Praesent viverra pretium enim blandit facilisis. Proin maximus ultricies nisi eu varius. Ut id justo in nibh posuere eleifend. Etiam tempus, nibh quis viverra viverra, nisl justo bibendum turpis, id aliquet dui lacus ut ante. Proin ante ante, lacinia sit amet nisi porta, fermentum bibendum neque. Fusce ligula dolor, lacinia vel consequat molestie, convallis ut nisi. Ut pellentesque mattis diam, ac faucibus tellus condimentum ut. Mauris mattis, ex vitae ullamcorper bibendum, erat sem iaculis elit, eget faucibus turpis nisl a turpis. Cras nec urna in elit condimentum condimentum. Fusce rhoncus posuere mauris, id consequat nunc gravida id. Mauris aliquam dolor vitae massa tempor, at euismod mauris hendrerit.