PatchIndex: exploiting approximate constraints in distributed databases. - In: Distributed and parallel databases : an international journal.. - New York, NY [u.a.] : Consultants Bureau, ISSN 1573-7578, Bd. 39 (2021), 3, S. 833-853
Cloud data warehouse systems lower the barrier to access data analytics. These applications often lack a database administrator and integrate data from various sources, potentially leading to data not satisfying strict constraints. Automatic schema optimization in self-managing databases is difficult in these environments without prior data cleaning steps. In this paper, we focus on constraint discovery as a subtask of schema optimization. Perfect constraints might not exist in these unclean datasets due to a small set of values violating the constraints. Therefore, we introduce the concept of a generic PatchIndex structure, which handles exceptions to given constraints and enables database systems to define these approximate constraints. We apply the concept to the environment of distributed databases, providing parallel index creation approaches and optimization techniques for parallel queries using PatchIndexes. Furthermore, we describe heuristics for automatic discovery of PatchIndex candidate columns and prove the performance benefit of using PatchIndexes in our evaluation.
Workload-driven placement of column-store data structures on DRAM and NVM. - In: DAMON '21: proceedings of the 17th International Workshop on Data Management on New Hardware (DaMoN 2021) : Virtual Event China, 21 June 2021.. - New York, NY : Association for Computing Machinery, (2021), 5, insges. 8 S.
Non-volatile memory (NVM) offers lower costs per capacity and higher total capacities than DRAM. However, NVM cannot simply be used as a drop-in replacement for DRAM in database management systems due to its different performance characteristics. We thus investigate the placement of column-store data structures in a hybrid hierarchy of DRAM and NVM, with the goal of placing as much data as possible in NVM without compromising performance. After analyzing how different memory access patterns affect query runtimes when columns are placed in NVM, we propose a heuristic that leverages lightweight access counters to suggest which structures should be placed in DRAM and which in NVM. Our evaluation using TPC-H shows that more than 80% of the data touched by queries can be placed in NVM with almost no slowdown, while naively placing all data in NVM would increase runtime by 53%.
Instant graph query recovery on persistent memory. - In: DAMON '21: proceedings of the 17th International Workshop on Data Management on New Hardware (DaMoN 2021) : Virtual Event China, 21 June 2021.. - New York, NY : Association for Computing Machinery, (2021), 10, insges. 4 S.
Persistent memory (PMem) - also known as non-volatile memory (NVM) - offers new opportunities not only for the design of data structures and system architectures but also for failure recovery in databases. However, instant recovery can mean not only to bring the system up as fast as possible but also to continue long-running queries which have been interrupted by a system failure. In this work, we discuss how PMem can be utilized to implement query recovery for analytical graph queries. Furthermore, we investigate the trade-off between the overhead of managing the query state in PMem at query runtime as well as the recovery and restart costs.
When bears get machine support: applying machine learning models to scalable DataFrames with Grizzly. - In: Datenbanksysteme für Business, Technologie und Web (BTW 2021). - Bonn : Gesellschaft für Informatik, (2021), , S. 195-214
The popular Python Pandas framework provides an easy-to-use DataFrame API that enables a broad range of users to analyze their data. However, Pandas faces severe scalability issues in terms of runtime and memory consumption, limiting the usability of the framework. In this paper we present Grizzly, a replacement for Python Pandas. Instead of bringing data to the operators like Pandas, Grizzly ships program complexity to database systems by transpiling the DataFrame API to SQL code. Additionally, Grizzly offers user-friendly support for combining different data sources, user-defined functions, and applying Machine Learning models directly inside the database system. Our evaluation shows that Grizzly significantly outperforms Pandas as well as state-of-the-art frameworks for distributed Python processing in several use cases.
Updatable materialization of approximate constraints. - In: IEEE Xplore digital library. - New York, NY : IEEE, ISSN 2473-2001, (2021), , S. 1991-1996
Modern big data applications integrate data from various sources. As a result, these datasets may not satisfy perfect constraints, leading to sparse schema information and non-optimal query performance. The existing approach of PatchIndexes enable the definition of approximate constraints and improve query performance by exploiting the materialized constraint information. As real world data warehouse workloads are often not limited to read-only queries, we enhance the PatchIndex structure towards an update-conscious design in this paper. Therefore, we present a sharded bitmap as the underlying data structure which offers efficient update operations, and describe approaches to maintain approximate constraints under updates, avoiding index recomputations and full table scans. In our evaluation, we prove that PatchIndexes provide more lightweight update support than traditional materialization approaches.
Von Daten zu Künstlicher Intelligenz - Datenmanagement als Basis für erfolgreiche KI-Anwendungen. - In: Digitale Welt : the magazine of digital transformation : science meets industry.. - [Hannover] : eMedia Gesellschaft für Elektronische Medien mbH, ISSN 2569-1996, Bd. 5 (2021), 3, S. 75-79
Adaptive query compilation in graph databases. - In: IEEE Xplore digital library. - New York, NY : IEEE, ISSN 2473-2001, (2021), , S. 112-119
Compiling database queries into compact and efficient machine code has proven to be a great technique to improve query performance and to exploit characteristics of modern hardware. Furthermore, compilation frameworks like LLVM provide powerful optimization techniques and support different backends. However, the time for generating machine code becomes an issue for short-running queries or queries which could produce early results quickly. In this work, we present an adaptive approach integrating graph query interpretation and compilation. While query compilation and code generation are running in the background, the query execution starts using the interpreter. As soon as the code generation is finished, the execution switches to the compiled code. Our evaluation shows that autonomously switching execution modes helps to hide compilation times.
Datenbanksysteme für Business, Technologie und Web (BTW 2021) : 13.-17. September 2021 in Dresden, Deutschland. - Bonn : Gesellschaft für Informatik, 2021. - 1 CD-ROM. . - (GI-Edition. Proceedings. - 311) ISBN 978-3-88579-705-0
Hardware acceleration of modern data management. - In: Advances in engineering research and application : proceedings of the International Conference on Engineering Research and Applications, ICERA 2020.. - Cham : Springer International Publishing, (2021), , S. 3
Over the past thirty years, database management systems have been established as one of the most successful software concepts. In todays business environment they constitute the centerpiece of almost all critical IT systems. The reasons for this success are manyfold. On the one hand, such systems provide abstractions hiding the details of underlying hardware or operating systems layers. On the other hand, database management systems are ACID compliant, which enables them to represent an accurate picture of a real world scenario, and ensures correctness of the managed data.However, the currently used database concepts and systems are not well prepared to support emerging application domains such as eSciences, Industry 4.0, Internet of Things or Digital Humanities. Furthermore, volume, variety, veracity as well as velocity of data caused by ubiquitous sensors have to be mastered by massive scalability and online processing by providing traditional qualities of database systems like consistency, isolation and descriptive query languages. At the same time, current and future hardware trends provide new opportunities such as many-core CPUs, co-processors like GPU and FPGA, novel storage technologies like NVRAM and SSD as well as high-speed networks provide new opportunities.In this talk we present our research results for the use of modern hardware architectures for data management. We discuss the design of data structures for persistent memory and the use of accelerators like GPU and FPGA for database operations.
Advances in engineering research and application : proceedings of the International Conference on Engineering Research and Applications, ICERA 2020. - Cham : Springer International Publishing, 2021. - 1 Online-Ressource (xiv, 886 p.). . - (Lecture notes in networks and systems. - Volume 178) ISBN 978-3-030-64719-3
This proceedings book features volumes gathered selected contributions from the International Conference on Engineering Research and Applications (ICERA 2020) organized at Thai Nguyen University of Technology on December 1-2, 2020. The conference focused on the original researches in a broad range of areas, such as Mechanical Engineering, Materials and Mechanics of Materials, Mechatronics and Micromechatronics, Automotive Engineering, Electrical and Electronics Engineering, and Information and Communication Technology. Therefore, the book provides the research community with authoritative reports on developments in the most exciting areas in these fields.