Publications in the field

Below you will find an automated compilation of the publications of the group. For publications of the individual members of staff, please refer to their personal pages.

List of publications

Anzahl der Treffer: 283
Erstellt: Tue, 23 Apr 2024 23:02:17 +0200 in 0.0775 sec


Grollmisch, Sascha; Cano, Estefanía
Improving semi-supervised learning for audio classification with FixMatch. - In: Electronics, ISSN 2079-9292, Bd. 10 (2021), 15, 1807, insges. 20 S.

Including unlabeled data in the training process of neural networks using Semi-Supervised Learning (SSL) has shown impressive results in the image domain, where state-of-the-art results were obtained with only a fraction of the labeled data. The commonality between recent SSL methods is that they strongly rely on the augmentation of unannotated data. This is vastly unexplored for audio data. In this work, SSL using the state-of-the-art FixMatch approach is evaluated on three audio classification tasks, including music, industrial sounds, and acoustic scenes. The performance of FixMatch is compared to Convolutional Neural Networks (CNN) trained from scratch, Transfer Learning, and SSL using the Mean Teacher approach. Additionally, a simple yet effective approach for selecting suitable augmentation methods for FixMatch is introduced. FixMatch with the proposed modifications always outperformed Mean Teacher and the CNNs trained from scratch. For the industrial sounds and music datasets, the CNN baseline performance using the full dataset was reached with less than 5% of the initial training data, demonstrating the potential of recent SSL methods for audio data. Transfer Learning outperformed FixMatch only for the most challenging dataset from acoustic scene classification, showing that there is still room for improvement.



https://doi.org/10.3390/electronics10151807
Arend, Johannes M.; Garí, Sebastià V. Amengual; Schissler, Carl; Klein, Florian; Robinson, Philip W.
Six-degrees-of-freedom parametric spatial audio based on one monaural room impulse response. - In: Journal of the Audio Engineering Society, ISSN 0004-7554, Bd. 69 (2021), 7/8, S. 557-575

Parametric spatial audio rendering is a popular approach for low computing capacity applications, such as augmented reality systems. However most methods rely on spatial room impulse responses (SRIR) for sound field rendering with 3 degrees of freedom (DoF), i.e., for arbitrary head orientations of the listener, and often require multiple SRIRs for 6-DoF rendering, i.e., when additionally considering listener translations. This paper presents a method for parametric spatial audio rendering with 6 DoF based on one monaural room impulse response (RIR). The scalable and perceptually motivated encoding results in a parametric description of the spatial sound field for any listener's head orientation or position in space. These parameters form the basis for the binaural room impulse responses (BRIR) synthesis algorithm presented in this paper. The physical evaluation revealed good performance, with differences to reference measurements at most tested positions in a room below the just-noticeable differences of various acoustic parameters. The paper further describes the implementation of a 6-DoF realtime virtual acoustic environment (VAE) using the synthesized BRIRs. A pilot study assessing the plausibility of the 6-DoF VAE showed that the system can provide a plausible binaural reproduction, but it also revealed challenges of 6-DoF rendering requiring further research.



https://doi.org/10.17743/jaes.2021.0009
Grollmisch, Sascha; Cano, Estefanía; Mora Ángel, Fernando; López Gil, Gustavo
Ensemble size classification in Colombian Andean string music recordings. - In: Perception, representations, image, sound, music, (2021), S. 60-74

Reliable methods for automatic retrieval of semantic information from large digital music archives can play a critical role in musicological research and musical heritage preservation. With the advancement of machine learning techniques, new possibilities for information retrieval in scenarios where ground-truth data is scarce are now available. This work investigates the problem of ensemble size classification in music recordings. For this purpose, a new dataset of Colombian Andean string music was compiled and annotated by musicological experts. Different neural network architectures, as well as pre-processing steps and data augmentation techniques were systematically evaluated and optimized. The best deep neural network architecture achieved 81.5% file-wise mean class accuracy using only feed forward layers with linear magnitude spectrograms as input representation. This model will serve as a baseline for future research on ensemble size classification.



Werner, Stephan; Klein, Florian; Neidhardt, Annika; Sloma, Ulrike; Schneiderwind, Christian; Brandenburg, Karlheinz
Creation of auditory augmented reality using a position-dynamic binaural synthesis system - technical components, psychoacoustic needs, and perceptual evaluation. - In: Applied Sciences, ISSN 2076-3417, Bd. 11 (2021), 3, 1150, S. 1-20

For a spatial audio reproduction in the context of augmented reality, a position-dynamic binaural synthesis system can be used to synthesize the ear signals for a moving listener. The goal is the fusion of the auditory perception of the virtual audio objects with the real listening environment. Such a system has several components, each of which help to enable a plausible auditory simulation. For each possible position of the listener in the room, a set of binaural room impulse responses (BRIRs) congruent with the expected auditory environment is required to avoid room divergence effects. Adequate and efficient approaches are methods to synthesize new BRIRs using very few measurements of the listening room. The required spatial resolution of the BRIR positions can be estimated by spatial auditory perception thresholds. Retrieving and processing the tracking data of the listener’s head-pose and position as well as convolving BRIRs with an audio signal needs to be done in real-time. This contribution presents work done by the authors including several technical components of such a system in detail. It shows how the single components are affected by psychoacoustics. Furthermore, the paper also discusses the perceptive effect by means of listening tests demonstrating the appropriateness of the approaches.



https://doi.org/10.3390/app11031150
Lenzen, Lucien;
Konzept zur Einführung von HDR im Broadcast mithilfe präferenzbasierter Kontrastkompression. - Ilmenau : Universitätsbibliothek, 2020. - 1 Online-Ressource (xv, 167 Blätter)
Technische Universität Ilmenau, Dissertation 2021

HDR (High Dynamic Range) ermöglicht es, einen weitaus größeren Kontrastumfang einer Szene einzufangen als es im HD-Broadcast der Fall wäre. In der Folge können Details sowohl in den Lichtern als auch in den Schatten erhalten werden. Allerdings sind die Möglichkeiten zur Wiedergabe sehr heterogen und meist deutlich limitierter. Um trotzdem alle Zuschauer von der gesteigerten Aufnahmequalität profitieren zu lassen, wird eine Anpassung - auch Kontrastkompression genannt - nötig. Manuelle Techniken zur Kontrastkompression sind aus der filmischen Postproduktion bekannt, während automatische Verfahren in der Computergrafik Anwendung finden. Aufgrund der speziellen Anforderungen des Broadcast lassen sich diese jedoch nicht einfach übertragen. Eine grundlegende Herausforderung besteht dabei in der Präferenz des Zuschauers. Das Ziel der Arbeit ist es deshalb, die Zuschauerpräferenz bezüglich der Helligkeits- und Farbwahrnehmung zu quantifizieren und anschließend auf diesen Ergebnissen eine algorithmische Lösung zur Anpassung der Kontrastkompression für die Anwendung beim Broadcast anzubieten. Mithilfe von objektiven und subjektiven Untersuchungen soll gezeigt werden, wie sich hierdurch die Bildqualität signifikant steigern lässt. Abschließend gilt es anhand von beispielhaften Workflows und Feldversuchen einen Weg für die flächendeckende Einführung von HDR aufzuzeigen.



https://nbn-resolving.org/urn:nbn:de:gbv:ilm1-2021000124
Neidhardt, Annika; Reif, Boris
Minimum BRIR grid resolution for interactive position changes in dynamic binaural synthesis. - In: 148th Audio Engineering Society International Convention 2020, (2020), S. 660-669

Grollmisch, Sascha; Cano, Estefanía; Kehling, Christian; Taenzer, Michael
Analyzing the potential of pre-trained embeddings for audio classification tasks. - In: 28th European Signal Processing Conference (EUSIPCO 2020), (2020), S. 790-794

In the context of deep learning, the availability of large amounts of training data can play a critical role in a models performance. Recently, several models for audio classification have been pre-trained in a supervised or self-supervised fashion on large datasets to learn complex feature representations, socalled embeddings. These embeddings can then be extracted from smaller datasets and used to train subsequent classifiers. In the field of audio event detection (AED) for example, classifiers using these features have achieved high accuracy without the need of additional domain knowledge. This paper evaluates three state-of-the-art embeddings on six audio classification tasks from the fields of music information retrieval and industrial sound analysis. The embeddings are systematically evaluated by analyzing the influence on classification accuracy of classifier architecture, fusion methods for file-wise predictions, amount of training data, and initial training domain of the embeddings. To better understand the impact of the pre-training step, results are also compared with those acquired with models trained from scratch. On average, the OpenL3 embeddings performed best with a linear SVM classifier. For a reduced amount of training examples, OpenL3 outperforms the initial baseline.



https://doi.org/10.23919/Eusipco47968.2020.9287743
Johnson, David S.; Grollmisch, Sascha
Techniques improving the robustness of deep learning models for industrial sound analysis. - In: 28th European Signal Processing Conference (EUSIPCO 2020), (2020), S. 81-85

The field of Industrial Sound Analysis (ISA) aims to automatically identify faults in production machinery or manufactured goods by analyzing audio signals. Publications in this field have shown that the surface condition of metal balls and different types of bulk materials (screws, nuts, etc.) sliding down a tube can be classified with a high accuracy using audio signals and deep neural networks. However, these systems suffer from domain shift, or dataset bias, due to minor changes in the recording setup which may easily happen in real-world production lines. This paper aims at finding methods to increase robustness of existing detection systems to domain shift, ideally without the need to record new data or retrain the models. Through five experiments, we implement a convolutional neural network (CNN) for two publicly available ISA datasets and evaluate transfer learning, data normalization and data augmentation as approaches to deal with domain shift. Our results show that while supervised methods with additional labeled data are the best approach, an unsupervised method that implements data augmentation with adaptive normalization is able to improve the performance by a large margin without the need of retraining neural networks.



https://doi.org/10.23919/Eusipco47968.2020.9287327
Brandenburg, Karlheinz; Klein, Florian; Neidhardt, Annika; Sloma, Ulrike; Werner, Stephan
Creating auditory illusions with binaural technology. - In: The technology of binaural understanding, (2020), S. 623-663

It is pointed out that beyond reproducing the physically correct sound pressure at the eardrums, more effects play a significant role in the quality of the auditory illusion. In some cases, these can dominate perception and even overcome physical deviations. Perceptual effects like the room-divergence effect, additional visual influences, personalization, pose and position tracking as well as adaptation processes are discussed. These effects are described individually, and the interconnections between them are highlighted. With the results from experiments performed by the authors, the perceptual effects can be quantified. Furthermore, concepts are proposed to optimize reproduction systems with regard to those effects. One example could be a system that adapts to varying listening situations as well as individual listening habits, experience and preference.



Grollmisch, Sascha; Johnson, David; Liebetrau, Judith
Visualizing neural network decisions for industrial sound analysis. - In: SMSI 2020, (2020), S. 267-268