Anzahl der Treffer: 325
Erstellt: Thu, 28 Mar 2024 23:11:59 +0100 in 0.0905 sec


Lasch, Robert; Legler, Thomas; May, Norman; Scheirle, Bernhard; Sattler, Kai-Uwe
Cooperative memory management for table and temporary data. - In: 1st Workshop on Simplicity in Management of Data, (2023), 2, insges. 5 S.

The traditional paradigm for managing memory in database management systems (DBMS) treats memory used for caching table data and memory for temporary data as separate entities. This leads to inefficient utilization of the available memory capacity for mixed workloads. With memory being a significant factor in the costs of operating a DBMS, utilizing memory as efficiently as possible is highly desirable. As an alternative to the traditional paradigm, we propose managing the entire available memory in a cooperative manner to achieve better memory utilization and consequently higher cost-effectiveness for DBMSs. Initial experimental evaluation of cooperative memory management using a prototype implementation shows promising results and leads to several interesting further research directions.



https://doi.org/10.1145/3596225.3596230
Schlegel, Marius; Sattler, Kai-Uwe
MLflow2PROV: extracting provenance from machine learning experiments. - In: Proceedings of the Seventh Workshop on Data Management for End-to-End Machine Learning (DEEM), (2023), 9, insges. 4 S.

Supporting iterative and explorative workflows for developing machine learning (ML) models, ML experiment management systems (ML EMSs), such as MLflow, are increasingly used to simplify the structured collection and management of ML artifacts, such as ML models, metadata, and code. However, EMSs typically suffer from limited provenance capabilities. As a consequence, it is hard to analyze provenance information and gain knowledge that can be used to improve both ML models and their development workflows. We propose a W3C-PROV-compliant provenance model capturing ML experiment activities that originate from Git and MLflow usage. Moreover, we present the tool MLflow2PROV that extracts provenance graphs according to our model, enabling querying, analyzing, and further processing of collected provenance information.



https://doi.org/10.1145/3595360.3595859
Baumstark, Alexander; Jibril, Muhammad Attahir; Sattler, Kai-Uwe
Processing-in-Memory for databases: query processing and data transfer. - In: 19th International Workshop on Data Management on New Hardware, (DaMoN 2023), June 19th 2023, (2023), S. 107-111

The Processing-in-Memory (PIM) paradigm promises to accelerate data processing by pushing down computation to memory, reducing the amount of data transfer between memory and CPU, and - in this way - relieving the CPU from processing. Particularly, in in-memory databases memory access becomes a performance bottleneck. Thus, PIM seems to offer an interesting solution for database processing. In this work, we investigate how commercially available PIM technology can be leveraged to accelerate query processing by offloading (parts of) query operators to memory. Furthermore, we show how to address the problem of limited PIM storage capacity by interleaving transfer and computation and present a cost model for the data placement problem.



https://doi.org/10.1145/3592980.3595323
Baumstark, Alexander; Jibril, Muhammad Attahir; Sattler, Kai-Uwe
Adaptive query compilation with Processing-in-Memory. - In: 2023 IEEE 39th International Conference on Data Engineering workshops, (2023), S. 191-197

The challenge of today’s DBMS is to integrate modern hardware properly in order to provide efficiency and performance. While emerging technologies like Processing-in-Memory (PIM) reduce the bottleneck when accessing memory by offloading computation, DBMSs must adapt to the new characteristics and the provided processing models in order to make use of it efficiently. The Single Program Multiple Data (SPMD) models require a special precompiled program for PIM-enabled chips in the UPMEM architecture. Integrating this model into the query processing of a DBMS can improve the overall performance by efficiently exploiting the underlying characteristic of high parallel execution directly on memory. To address this, we propose an approach to integrate this programming model directly into the query processing by leveraging adaptive query compilation. The experiment results show an improvement in the execution times compared to the execution on non-PIM hardware.



https://doi.org/10.1109/ICDEW58674.2023.00035
Kläbe, Steffen;
Modern data analytics in the cloud era. - Ilmenau : Universitätsbibliothek, 2023. - 1 Online-Ressource (171 Seiten)
Technische Universität Ilmenau, Dissertation 2023

Cloud Computing ist die dominante Technologie des letzten Jahrzehnts. Die Benutzerfreundlichkeit der verwalteten Umgebung in Kombination mit einer nahezu unbegrenzten Menge an Ressourcen und einem nutzungsabhängigen Preismodell ermöglicht eine schnelle und kosteneffiziente Projektrealisierung für ein breites Nutzerspektrum. Cloud Computing verändert auch die Art und Weise wie Software entwickelt, bereitgestellt und genutzt wird. Diese Arbeit konzentriert sich auf Datenbanksysteme, die in der Cloud-Umgebung eingesetzt werden. Wir identifizieren drei Hauptinteraktionspunkte der Datenbank-Engine mit der Umgebung, die veränderte Anforderungen im Vergleich zu traditionellen On-Premise-Data-Warehouse-Lösungen aufweisen. Der erste Interaktionspunkt ist die Interaktion mit elastischen Ressourcen. Systeme in der Cloud sollten Elastizität unterstützen, um den Lastanforderungen zu entsprechen und dabei kosteneffizient zu sein. Wir stellen einen elastischen Skalierungsmechanismus für verteilte Datenbank-Engines vor, kombiniert mit einem Partitionsmanager, der einen Lastausgleich bietet und gleichzeitig die Neuzuweisung von Partitionen im Falle einer elastischen Skalierung minimiert. Darüber hinaus führen wir eine Strategie zum initialen Befüllen von Puffern ein, die es ermöglicht, skalierte Ressourcen unmittelbar nach der Skalierung auszunutzen. Cloudbasierte Systeme sind von fast überall aus zugänglich und verfügbar. Daten werden häufig von zahlreichen Endpunkten aus eingespeist, was sich von ETL-Pipelines in einer herkömmlichen Data-Warehouse-Lösung unterscheidet. Viele Benutzer verzichten auf die Definition von strikten Schemaanforderungen, um Transaktionsabbrüche aufgrund von Konflikten zu vermeiden oder um den Ladeprozess von Daten zu beschleunigen. Wir führen das Konzept der PatchIndexe ein, die die Definition von unscharfen Constraints ermöglichen. PatchIndexe verwalten Ausnahmen zu diesen Constraints, machen sie für die Optimierung und Ausführung von Anfragen nutzbar und bieten effiziente Unterstützung bei Datenaktualisierungen. Das Konzept kann auf beliebige Constraints angewendet werden und wir geben Beispiele für unscharfe Eindeutigkeits- und Sortierconstraints. Darüber hinaus zeigen wir, wie PatchIndexe genutzt werden können, um fortgeschrittene Constraints wie eine unscharfe Multi-Key-Partitionierung zu definieren, die eine robuste Anfrageperformance bei Workloads mit unterschiedlichen Partitionsanforderungen bietet. Der dritte Interaktionspunkt ist die Nutzerinteraktion. Datengetriebene Anwendungen haben sich in den letzten Jahren verändert. Neben den traditionellen SQL-Anfragen für Business Intelligence sind heute auch datenwissenschaftliche Anwendungen von großer Bedeutung. In diesen Fällen fungiert das Datenbanksystem oft nur als Datenlieferant, während der Rechenaufwand in dedizierten Data-Science- oder Machine-Learning-Umgebungen stattfindet. Wir verfolgen das Ziel, fortgeschrittene Analysen in Richtung der Datenbank-Engine zu verlagern und stellen das Grizzly-Framework als DataFrame-zu-SQL-Transpiler vor. Auf dieser Grundlage identifizieren wir benutzerdefinierte Funktionen (UDFs) und maschinelles Lernen (ML) als wichtige Aufgaben, die von einer tieferen Integration in die Datenbank-Engine profitieren würden. Daher untersuchen und bewerten wir Ansätze für die datenbankinterne Ausführung von Python-UDFs und datenbankinterne ML-Inferenz.



https://doi.org/10.22032/dbt.57434
Baumstark, Alexander; Jibril, Muhammad Attahir; Sattler, Kai-Uwe
Adaptive query compilation in graph databases. - In: Distributed and parallel databases, ISSN 1573-7578, Bd. 41 (2023), 3, S. 359-386

Compiling database queries into compact and efficient machine code has proven to be a great technique to improve query performance and exploit characteristics of modern hardware. Particularly for graph database queries, which often execute the exact instructions for processing, this technique can lead to an improvement. Furthermore, compilation frameworks like LLVM provide powerful optimization techniques and support different backends. However, the time for generating and optimizing machine code becomes an issue for short-running queries or queries which could produce early results quickly. In this work, we present an adaptive approach integrating graph query interpretation and compilation. While query compilation and code generation are running in the background, the query execution starts using the interpreter. When the code generation is finished, the execution switches to the compiled code. Our evaluation of the approach using short-running and complex queries show that autonomously switching execution modes helps to improve the runtime of all types of queries and additionally to hide compilation times and the additional latencies of the underlying storage.



https://doi.org/10.1007/s10619-023-07430-4
Tröbs, Eric; Hagedorn, Stefan; Sattler, Kai-Uwe
JPTest - grading data science exercises in Jupyter made short, fast and scalable. - In: Datenbanksysteme für Business, Technologie und Web (BTW 2023), (2023), S. 673-679

Jupyter Notebook is not only a popular tool for publishing data science results, but canalso be used for the interactive explanation of teaching content as well as the supervised work onexercises. In order to give students feedback on their solutions, it is necessary to check and evaluatethe submitted work. To exploit the possibilities of remote learning as well as to reduce the workneeded to evaluate submissions, we present a flexible and efficient framework. It enables automatedchecking of notebooks for completeness and syntactic correctness as well as fine-grained evaluationof submitted tasks. The framework comes with a high level of parallelization, isolation and a shortand efficient API.



Jibril, Muhammad Attahir; Baumstark, Alexander; Sattler, Kai-Uwe
Adaptive update handling for graph HTAP. - In: Distributed and parallel databases, ISSN 1573-7578, Bd. 41 (2023), 3, S. 331-357

Hybrid transactional/analytical processing (HTAP) workloads on graph data can significantly benefit from GPU accelerators. However, to exploit the full potential of GPU processing, dedicated graph representations are necessary, which mostly make in-place updates difficult. In this paper, we discuss an adaptive update handling approach in a graph database system for HTAP workloads. We discuss and evaluate strategies for propagating transactional updates from an update-friendly table storage to a GPU-optimized sparse matrix format for analytics.



https://doi.org/10.1007/s10619-023-07428-y
Jibril, Muhammad Attahir; Baumstark, Alexander; Sattler, Kai-Uwe
GTPC: towards a hybrid OLTP-OLAP graph benchmark. - In: Datenbanksysteme für Business, Technologie und Web (BTW 2023), (2023), S. 105-117

Graph databases are gaining increasing relevance not only for pure analytics but alsofor full transactional support. Business requirements are evolving to demand analytical insights onfresh transactional data, thereby triggering the emergence of graph systems for hybrid transactional-analytical graph processing (HTAP). In this paper, we present our ongoing work on GTPC, a hybridgraph benchmark targeting such systems, based on the TPC-C and TPC-H benchmarks.



Baumstark, Alexander; Jibril, Muhammad Attahir; Sattler, Kai-Uwe
Accelerating large table scan using Processing-In-Memory technology. - In: Datenbanksysteme für Business, Technologie und Web (BTW 2023), (2023), S. 797-814

Today’s systems are capable of storing large amounts of data in main memory. In-memoryDBMSs can benefit particularly from this development. However, the processing of the data fromthe main memory necessarily has to run via the CPU. This creates a bottleneck, which affects thepossible performance of the DBMS. The Processing-In-Memory (PIM) technology is a paradigm toovercome this problem, which was not available in commercial systems for a long time. However, withthe availability of UPMEM, a commercial system is finally available that provides PIM technologyin hardware. In this work, the main focus was on the optimization of the table scan, a fundamental,and memory-bound operation. Here a possible approach is shown, which can be used to optimizethis operation by using PIM. This method was then tested for parallelism and execution time inbenchmarks with different table sizes and compared to the usual table scan. The result is a table scanthat outperforms the scan on the usual CPU significantly.