Anzahl der Treffer: 326
Erstellt: Thu, 18 Apr 2024 23:13:43 +0200 in 0.0721 sec


Baumstark, Alexander; Jibril, Muhammad Attahir; Sattler, Kai-Uwe
Accelerating large table scan using Processing-In-Memory technology. - In: Datenbanksysteme für Business, Technologie und Web (BTW 2023), (2023), S. 797-814

Today’s systems are capable of storing large amounts of data in main memory. In-memoryDBMSs can benefit particularly from this development. However, the processing of the data fromthe main memory necessarily has to run via the CPU. This creates a bottleneck, which affects thepossible performance of the DBMS. The Processing-In-Memory (PIM) technology is a paradigm toovercome this problem, which was not available in commercial systems for a long time. However, withthe availability of UPMEM, a commercial system is finally available that provides PIM technologyin hardware. In this work, the main focus was on the optimization of the table scan, a fundamental,and memory-bound operation. Here a possible approach is shown, which can be used to optimizethis operation by using PIM. This method was then tested for parallelism and execution time inbenchmarks with different table sizes and compared to the usual table scan. The result is a table scanthat outperforms the scan on the usual CPU significantly.



Schlegel, Marius; Sattler, Kai-Uwe
Management of machine learning lifecycle artifacts: a survey. - In: ACM SIGMOD record, Bd. 51 (2023), 4, S. 18-35

The explorative and iterative nature of developing and operating ML applications leads to a variety of artifacts, such as datasets, features, models, hyperparameters, metrics, software, configurations, and logs. In order to enable comparability, reproducibility, and traceability of these artifacts across the ML lifecycle steps and iterations, systems and tools have been developed to support their collection, storage, and management. It is often not obvious what precise functional scope such systems offer so that the comparison and the estimation of synergy effects between candidates are quite challenging. In this paper, we aim to give an overview of systems and platforms which support the management of ML lifecycle artifacts. Based on a systematic literature review, we derive assessment criteria and apply them to a representative selection of more than 60 systems and platforms.



https://doi.org/10.1145/3582302.3582306
Kläbe, Steffen; DeSantis, Bobby; Hagedorn, Stefan; Sattler, Kai-Uwe
Accelerating Python UDFs in vectorized query execution. - [USA?] : CIDR Conference. - 1 Online-Ressource (7 Seiten)Publikation entstand im Rahmen der Veranstaltung: CIDR 2022 : 12th Annual Conference on Innovative Data Systems Research (CIDR ’22), January 9-12, 2022, Chaminade, USA

https://doi.org/10.22032/dbt.59388
Lasch, Robert; Moghaddamfar, Mehdi; May, Norman; Demirsoy, Suleyman S.; Färber, Christian; Sattler, Kai-Uwe
Bandwidth-optimal relational joins on FPGAs. - Konstanz : University of Konstanz. - 1 Online-Ressource (Seite 1:27-1:39)Online-Ausgabe: Proceedings of the 25th International Conference on Extending Database Technology, EDBT 2022, Edinburgh, UK, March 29 - April 1, 2022. - OpenProceedings.org 2022, ISBN 978-3-89318-086-8

https://doi.org/10.5441/002/edbt.2022.03
Schlegel, Marius; Sattler, Kai-Uwe
Cornucopia: tool support for selecting machine learning lifecycle artifact management systems. - Setúbal : Scitepress. - 1 Online-Ressource (Seite 444-450)Online-Ausgabe: Proceedings of the 18th International Conference on Web Information Systems and Technologies, WEBIST, October 25-27, 2022, in Valletta, Malta / editors: Stefan Decker ... - Setúbal : Scitepress, 2022, ISBN 978-989-758-613-2

The explorative and iterative nature of developing and operating machine learning (ML) applications leads to a variety of ML artifacts, such as datasets, models, hyperparameters, metrics, software, and configurations. To enable comparability, traceability, and reproducibility of ML artifacts across the ML lifecycle steps and iterations, platforms, frameworks, and tools have been developed to support their collection, storage, and management. Selecting the best-suited ML artifact management systems (AMSs) for a particular use case is often challenging and time-consuming due to the plethora of AMSs, their different focus, and imprecise specifications of features and properties. Based on assessment criteria and their application to a representative selection of more than 60 AMSs, this paper introduces an interactive web tool that enables the convenient and time-efficient exploration and comparison of ML AMSs.



https://doi.org/10.5220/0011591700003318
Al-Sayeh, Hani; Jibril, Muhammad Attahir; Bin Saeed, Muhammad Waleed; Sattler, Kai-Uwe
SparkCAD: caching anomalies detector for spark applications. - In: Proceedings of the VLDB Endowment, ISSN 2150-8097, Bd. 15 (2022), 12, S. 3694-3697

Developers of Apache Spark applications can accelerate their workloads by caching suitable intermediate results in memory and reusing them rather than recomputing them all over again every time they are needed. However, as scientific workflows are becoming more complex, application developers are becoming more prone to making wrong caching decisions, which we refer to as caching anomalies, that lead to poor performance. We present and give a demonstration of Spark Caching Anomalies Detector (SparkCAD), a developer decision support tool that visualizes the logical plan of Spark applications and detects caching anomalies.



https://doi.org/10.14778/3554821.3554877
Sattler, Kai-Uwe; Härder, Theo
Editorial. - In: Datenbank-Spektrum, ISSN 1610-1995, Bd. 22 (2022), 1, S. 1-4

https://doi.org/10.1007/s13222-022-00405-2
Al-Sayeh, Hani; Jibril, Muhammad Attahir; Memishi, Bunjamin; Sattler, Kai-Uwe
Blink: lightweight sample runs for cost optimization of big data applications. - In: New Trends in Database and Information Systems, (2022), S. 144-154

Distributed in-memory data processing engines accelerate iterative applications by caching datasets in memory rather than recomputing them in each iteration. Selecting a suitable cluster size for caching these datasets plays an essential role in achieving optimal performance. We present Blink, an autonomous sampling-based framework, which predicts sizes of cached datasets and selects optimal cluster size without relying on historical runs. We evaluate Blink on iterative, real-world, machine learning applications. With an average sample runs cost of 4.6% compared to the cost of optimal runs, Blink selects the optimal cluster size, saving up to 47.4% of execution cost compared to average cost.



https://doi.org/10.1007/978-3-031-15743-1_14
Lasch, Robert; Legler, Thomas; May, Norman; Scheirle, Bernhard; Sattler, Kai-Uwe
Cost modelling for optimal data placement in heterogeneous main memory. - In: Proceedings of the VLDB Endowment, ISSN 2150-8097, Bd. 15 (2022), 11, S. 2867-2880

The cost of DRAM contributes significantly to the operating costs of in-memory database management systems (IMDBMS). Persistent memory (PMEM) is an alternative type of byte-addressable memory that offers - in addition to persistence - higher capacities than DRAM at a lower price with the disadvantage of increased latencies and reduced bandwidth. This paper evaluates PMEM as a cheaper alternative to DRAM for storing table base data, which can make up a significant fraction of an IMDBMS' total memory footprint. Using a prototype implementation in the SAP HANA IMDBMS, we find that placing all table data in PMEM can reduce query performance in analytical benchmarks by more than a factor of two, while transactional workloads are less affected. To quantify the performance impact of placing individual data structures in PMEM, we propose a cost model based on a lightweight workload characterization. Using this model, we show how to place data pareto-optimally in the heterogeneous memory. Our evaluation demonstrates the accuracy of the model and shows that it is possible to place more than 75% of table data in PMEM while keeping performance within 10% of the DRAM baseline for two analytical benchmarks.



https://doi.org/10.14778/3551793.3551837
Räth, Timo;
Interactive and explorative stream processing. - In: DEBS 2022, (2022), S. 194-197

Formulating a suitable stream processing pipeline for a particular use case is a complicated process that highly depends on the processed data and usually requires many cycles of refinement. By combining the advantages of visual data exploration with the concept of real-time modifiability of a stream processing pipeline we want to contribute an interactive approach that simplifies and enhances the process of pipeline engineering. As a proof of concept, a prototype has been developed that delivers promising results in various test use cases and allows to modify the parameters and structure of stream processing pipelines at a development stage in a matter of milliseconds. By utilizing collected data and statistics from this explorative intermediate stage we will automatically generate optimized runtime code for a standalone execution of the constructed pipeline.



https://doi.org/10.1145/3524860.3543287