Combining binaural and cortical features for robust speech recognition
Constantin Spille, Birger Kollmeier and Bernd T. Meyer (2017)
IEEE/ACM Transactions on Audio, Speech, and Language Processing 25(4), 756 - 767, April 2017
The segregation of concurrent speakers and other sound sources is an important ability of the human auditory system, but is missing in most current systems for automatic speech recognition (ASR), resulting in a large gap between human and machine performance. This study combines processing related to peripheral and cortical stages of the auditory pathway: A physiologically-motivated binaural model estimates the positions of moving speakers to enhance the desired speech signal. Secondly, signals are converted to spectro-temporal Gabor features that resemble cortical speech representations and which have been shown to improve ASR in noisy conditions. Spectro-temporal Gabor features improve recognition results in all acoustic conditions under consideration compared to Melfrequency cepstral coefficients (MFCC). Binaural processing results in lower WERs in acoustic scenes with a concurrent speaker, whereas monaural processing should be preferred in the presence of a stationary masking noise. In-depth analysis of binaural processing identifies crucial processing steps such as localization of sound sources and estimation of the beamformer’s noise coherence matrix, and shows how much each processing step affects the recognition performance in acoustic conditions with different complexity.