Bandwidth-optimal relational joins on FPGAs. - Konstanz : University of Konstanz. - 1 Online-Ressource (Seite 1:27-1:39)Online-Ausgabe: Proceedings of the 25th International Conference on Extending Database Technology, EDBT 2022, Edinburgh, UK, March 29 - April 1, 2022. - OpenProceedings.org 2022, ISBN 978-3-89318-086-8
https://doi.org/10.5441/002/edbt.2022.03
Cornucopia: tool support for selecting machine learning lifecycle artifact management systems. - Setúbal : Scitepress. - 1 Online-Ressource (Seite 444-450)Online-Ausgabe: Proceedings of the 18th International Conference on Web Information Systems and Technologies, WEBIST, October 25-27, 2022, in Valletta, Malta / editors: Stefan Decker ... - Setúbal : Scitepress, 2022, ISBN 978-989-758-613-2
The explorative and iterative nature of developing and operating machine learning (ML) applications leads to a variety of ML artifacts, such as datasets, models, hyperparameters, metrics, software, and configurations. To enable comparability, traceability, and reproducibility of ML artifacts across the ML lifecycle steps and iterations, platforms, frameworks, and tools have been developed to support their collection, storage, and management. Selecting the best-suited ML artifact management systems (AMSs) for a particular use case is often challenging and time-consuming due to the plethora of AMSs, their different focus, and imprecise specifications of features and properties. Based on assessment criteria and their application to a representative selection of more than 60 AMSs, this paper introduces an interactive web tool that enables the convenient and time-efficient exploration and comparison of ML AMSs.
https://dx.doi.org/10.5220/0011591700003318
SparkCAD: caching anomalies detector for spark applications. - In: Proceedings of the VLDB Endowment, Bd. 15 (2022), 12, S. 3694-3697
Developers of Apache Spark applications can accelerate their workloads by caching suitable intermediate results in memory and reusing them rather than recomputing them all over again every time they are needed. However, as scientific workflows are becoming more complex, application developers are becoming more prone to making wrong caching decisions, which we refer to as caching anomalies, that lead to poor performance. We present and give a demonstration of Spark Caching Anomalies Detector (SparkCAD), a developer decision support tool that visualizes the logical plan of Spark applications and detects caching anomalies.
https://doi.org/10.14778/3554821.3554877
Editorial. - In: Datenbank-Spektrum, ISSN 1610-1995, Bd. 22 (2022), 1, S. 1-4
https://doi.org/10.1007/s13222-022-00405-2
Blink: lightweight sample runs for cost optimization of big data applications. - In: New Trends in Database and Information Systems, (2022), S. 144-154
Distributed in-memory data processing engines accelerate iterative applications by caching datasets in memory rather than recomputing them in each iteration. Selecting a suitable cluster size for caching these datasets plays an essential role in achieving optimal performance. We present Blink, an autonomous sampling-based framework, which predicts sizes of cached datasets and selects optimal cluster size without relying on historical runs. We evaluate Blink on iterative, real-world, machine learning applications. With an average sample runs cost of 4.6% compared to the cost of optimal runs, Blink selects the optimal cluster size, saving up to 47.4% of execution cost compared to average cost.
https://doi.org/10.1007/978-3-031-15743-1_14
Cost modelling for optimal data placement in heterogeneous main memory. - In: Proceedings of the VLDB Endowment, Bd. 15 (2022), 11, S. 2867-2880
The cost of DRAM contributes significantly to the operating costs of in-memory database management systems (IMDBMS). Persistent memory (PMEM) is an alternative type of byte-addressable memory that offers --- in addition to persistence --- higher capacities than DRAM at a lower price with the disadvantage of increased latencies and reduced bandwidth. This paper evaluates PMEM as a cheaper alternative to DRAM for storing table base data, which can make up a significant fraction of an IMDBMS' total memory footprint. Using a prototype implementation in the SAP HANA IMDBMS, we find that placing all table data in PMEM can reduce query performance in analytical benchmarks by more than a factor of two, while transactional workloads are less affected. To quantify the performance impact of placing individual data structures in PMEM, we propose a cost model based on a lightweight workload characterization. Using this model, we show how to place data pareto-optimally in the heterogeneous memory. Our evaluation demonstrates the accuracy of the model and shows that it is possible to place more than 75% of table data in PMEM while keeping performance within 10% of the DRAM baseline for two analytical benchmarks.
https://doi.org/10.14778/3551793.3551837
Interactive and explorative stream processing. - In: DEBS 2022, (2022), S. 194-197
Formulating a suitable stream processing pipeline for a particular use case is a complicated process that highly depends on the processed data and usually requires many cycles of refinement. By combining the advantages of visual data exploration with the concept of real-time modifiability of a stream processing pipeline we want to contribute an interactive approach that simplifies and enhances the process of pipeline engineering. As a proof of concept, a prototype has been developed that delivers promising results in various test use cases and allows to modify the parameters and structure of stream processing pipelines at a development stage in a matter of milliseconds. By utilizing collected data and statistics from this explorative intermediate stage we will automatically generate optimized runtime code for a standalone execution of the constructed pipeline.
https://doi.org/10.1145/3524860.3543287
StreamVizzard - an interactive and explorative stream processing editor. - In: DEBS 2022, (2022), S. 186-189
Processing continuous data streams is one of the hot topics of our time. A major challenge is the formulation of a suitable and efficient stream processing pipeline. This process is complicated by long restart times after pipeline modifications and tight dependencies on the actual data to process. To approach these issues, we have developed StreamVizzard - an interactive and explorative stream processing editor to simplify the pipeline engineering process. Our system allows to visually configure, execute, and completely modify a pipeline during runtime without any delay. Furthermore, an adaptive visualizer automatically displays the operator's processed data and statistics in a comprehensible way and allows the user to explore the data and support his design decisions. After the pipeline has been finalized our system automatically optimizes the pipeline based on collected statistics and generates standalone runtime code for productive use at a targeted stream processing engine.
https://doi.org/10.1145/3524860.3543283
Adaptive update handling for graph HTAP. - In: 2022 IEEE 38th International Conference on Data Engineering Workshops, (2022), S. 16-23
Processing hybrid transactional/analytical workloads on graph data can significantly benefit from GPU accelerators. However, to exploit the full potential of GPU processing, dedicated graph representations are necessary, which make inplace updates difficult. In this paper, we discuss an approach for adaptive handling of updates in a graph database system for HTAP workloads. We discuss and evaluate strategies for propagating updates from an update-friendly table storage to a GPU-optimized sparse matrix format.
https://doi.org/10.1109/ICDEW55742.2022.00007
JUGGLER: autonomous cost optimization and performance prediction of big data applications. - In: SIGMOD '22, (2022), S. 1840-1854
Distributed in-memory processing frameworks accelerate iterative workloads by caching suitable datasets in memory rather than recomputing them in each iteration. Selecting appropriate datasets to cache as well as allocating a suitable cluster configuration for caching these datasets play a crucial role in achieving optimal performance. In practice, both are tedious, time-consuming tasks and are often neglected by end users, who are typically not aware of workload semantics, sizes of intermediate data, and cluster specification. To address these problems, we present Juggler, an end-to-end framework, which autonomously selects appropriate datasets for caching and recommends a correspondingly suitable cluster configuration to end users, with the aim of achieving optimal execution time and cost. We evaluate Juggler on various iterative, real-world, machine learning applications. Compared with our baseline, Juggler reduces execution time to 25.1% and cost to 58.1%, on average, as a result of selecting suitable datasets for caching. It recommends optimal cluster configuration in 50% of cases and near-to-optimal configuration in the remaining cases. Moreover, Juggler achieves an average performance prediction accuracy of 90%.
https://doi.org/10.1145/3514221.3517892