Low latency time domain multichannel speech and music source separation. - In: Conference record of the Fifty-Fifth Asilomar Conference on Signals, Systems & Computers, (2021), S. 549-553
The Goal is to obtain a simple multichannel source separation with very low latency. Applications can be teleconferencing, hearing aids, augmented reality, or selective active noise cancellation. These real time applications need a very low latency, usually less than about 6 ms, and low complexity, because they usually run on small portable devices. For that we don't need the best separation, but "useful" separation, and not just on speech, but also music and noise. Usual frequency domain approaches have higher latency and complexity. Hence we introduce a novel probabilistic optimization method which we call "Random Directions", which can overcome local minima, applied to a simple time domain unmixing structure, and which is scalable for low complexity. Then it is compared to frequency domain approaches on separating speech and music sources, and using 3D microphone setups.
End-to-end learning for musical instruments classification. - In: Conference record of the Fifty-Fifth Asilomar Conference on Signals, Systems & Computers, (2021), S. 1607-1611
Musical instruments classification is a widely studied topic in Music Information Retrieval (MIR) and Signal Processing. The applications of this subject go from indexing of an audio database, automatic transcription, recommender systems, to music search by timbre, music annotation and others. Many different techniques were used along the years using deep neural networks with hand engineered features or learned features   . The purpose of this paper is to present Convolutional Neural Network (CNN) based Filter Banks that can generate not only features optimized for classification in the encoded domain but also achieving near perfect reconstruction in the decoder output with similar quality of standard lossy audio codecs. The filter banks are then compared with other commonly used invertible transformations employed as features in classification problems such as Short-time Fourier Transform (STFT) spectrograms and Mel spectrograms using a same simple classifier with a small number of parameters. The idea is that the heavy weight is lifted by the learned features and not the classifier whilst achieving near perfect reconstruction.
The method of random directions optimization for stereo audio source separation. - In: Cognitive intelligence for speech processing, (2020), S. 3316-3320
In this paper, a novel fast time domain audio source separation technique based on fractional delay filters with low computational complexity and small algorithmic delay is presented and evaluated in experiments. Our goal is a Blind Source Separation (BSS) technique, which can be applicable for the low cost and low power devices where processing is done in real-time, e.g. hearing aids or teleconferencing setups. The proposed approach optimizes fractional delays implemented as IIR filters and attenuation factors between microphone signals to minimize crosstalk, the principle of a fractional delay and sum beamformer. The experiments have been carried out for offline separation with stationary sound sources and for real-time with randomly moving sound sources. Experimental results show that separation performance of the proposed time domain BSS technique is competitive with State-of-the-Art (SoA) approaches but has lower computational complexity and no system delay like in frequency domain BSS.
Probabilistic optimization for source separation. - In: Conference record of the Fifty-Fourth Asilomar Conference on Signals, Systems & Computers, (2020), S. 534-538
We present a novel probabilistic Zeroth-Order optimization method, which can handle higher dimensions, and can also be used for fast online optimization, for instance for multichannel source separation. We compared it to the Gradientless Descent (GLD) algorithm on a multichannel source separation task, and found that our method results in faster and better separation (for the 2-channels case). For the multichannel case, only our method resulted in useful separation. We also applied it to separating sources from 3-dimensional microphone arrays, with comparable results.
Unsupervised interpretable representation learning for singing voice separation. - In: 28th European Signal Processing Conference (EUSIPCO 2020), (2020), S. 1412-1416
In this work, we present a method for learning interpretable music signal representations directly from waveform signals. Our method can be trained using unsupervised objectives and relies on the denoising auto-encoder model that uses a simple sinusoidal model as decoding functions to reconstruct the singing voice. To demonstrate the benefits of our method, we employ the obtained representations to the task of informed singing voice separation via binary masking, and measure the obtained separation quality by means of scale-invariant signal to distortion ratio. Our findings suggest that our method is capable of learning meaningful representations for singing voice separation, while preserving conveniences of the the short-time Fourier transform like non-negativity, smoothness, and reconstruction subject to time-frequency masking, that are desired in audio and music source separation.
Feature-based classification of electric guitar types. - In: Machine learning and knowledge discovery in databases, (2020), S. 478-484
Fast time domain stereo audio source separation using fractional delay filters. - In: 147th Audio Engineering Society Convention 2019, (2020), S. 179-183
Comparison of human and machine recognition of electric guitar types. - In: 147th Audio Engineering Society Convention 2019, (2020), S. 1058-1064
A fast stereo audio source separation for moving sources. - In: Conference record of the Fifty-Third Asilomar Conference on Signals, Systems & Computers, (2019), S. 1931-1935
Examining the perceptual effect of alternative objective functions for deep learning based music source separation. - In: Conference record of the Fifty-Second Asilomar Conference on Signals, Systems & Computers, (2018), S. 679-683