Workshop on "Computational Audition"
The question that is usually central in Computational Audition is "how to model audition". The question what to model is less often addressed. Typical answers to the 'what'-question are either experimental results or typical functions that would make useful applications, like source recognition. But what is the main concern of audition for a living agent?
We have asked sound annoyed persons to describe which sounds they like, which they dislike, why some sounds are more annoying than others, and what impact annoying sounds have on their life. The results indicate a complex interplay of perception and cognition in which a number of different basic processes continually search for sonic reassurance (pleasant sounds). If this cannot be estimated (e.g., because reassuring sounds are masked by other - unpleasant - sounds) conscious processing is tasked with a more elaborate vigilance function that 1) requires a minimal arousal level and 2) that forces conscious processing to estimate the (ir)relevance of salient ambient sounds which distracts from self-selected tasks. This model and the considerations that led to it are discussed.
This talk will present research that has been conducted at Sheffield under the EPSRC Computational Hearing in Multisource Environments (CHiME) project. One of the objectives of this project has been to explore the potential of techniques inspired by auditory scene analysis to deliver robust speech recognition in 'everyday' listening environments, that is, environments contain multiple competing sound sources mixed in reverberant environments. The talk will present an ASA-inspired approach to ASR that distinguishes between what Bregman would call `primitive grouping' and `schema driven processes'. The approach treats the foreground and background in an asymmetric manner, and will be contrasted with conventional 'model combination' techniques which operate by symmetrically decomposing the scene into a supposition of individual sources. The talk will discuss the balance between primitive grouping and schema driven processing. It will be argued that primitive grouping cues provides constraints that i/ allow the foreground to be reliably interpreted even when the background is unfamiliar, ii/ are essential in disambiguating complex scenes in which sources have similar temporal and spectral characteristics. It will also be argued that although primitive cues may be largely redundant when dealing with highly familiar or extremely simple acoustic scenes, they may play a crucial role when learning source models in real environments.
Many sound sources can only be recognised from the pattern of sounds they emit, and not from the individual sound events that make up their emission sequences. We propose a new model of auditory scene analysis, at the core of which is a process that seeks to discover predictable patterns in the ongoing sound sequence. Representations of predictable fragments are created on the fly, and are maintained, strengthened or weakened on the basis of their predictive success. Auditory perceptual organisation emerges from the competition between these representations (auditory proto-objects). Rather than positing a global interaction between all currently active proto-objects, competition is local and occurs between proto-objects that predict the same event at the same time. The model has been evaluated using the auditory streaming paradigm, and provides an intuitive account for many important phenomena including the emergence of, and switching between, alternative organisations, the influence of stimulus parameters on perceptual dominance, switching rate and perceptual phase durations, and the build-up of auditory streaming.
Martin Heckmann, Xavier Domont, Samuel Ngouoko, Honda Research Institute Europe GmbH
In this presentation I will present a hierarchical framework for the extraction of spectro-temporal acoustic features. The design of the features targets higher robustness in dynamic environments. Motivated by the large gap between human and machine performance in such conditions we take inspirations from the organization of the mammalian auditory cortex in the design of our features. This includes the joint processing of spectral and temporal information, the organization in hierarchical layers, competition between coequal features, the use of high-dimensional sparse feature spaces, and the learning of the underlying receptive fields in a data-driven manner. Due to these properties we termed the features as Hierarchical Spectro-Temporal (HIST) features.I will demonstrate via recognition results obtained in different environments that these features deliver complementary information to conventional spectral features and that this information improves recognition performance. Furthermore, I will highlight how a discriminate sub-space projection of the features can be used to further improve performance and how these features can adapt to different noise scenarios via an adaptive feature competition.
The auditory peripheral efferent system consists of both the acoustic reflex and medial olivo-cochlear reflex. There is, as yet, no generally agreed view on the signal-processing role of this system but has been widely predicted to play a beneficial role in speech perception in addition to obvious protective functions. To study this we have extended an existing auditory model of the physiology of the auditory periphery and examined its response to speech stimuli both in quiet and in the presence of interfering talkers. Our measure of intelligibility benefits is based on automatic speech recognition performance using the model as a front end to a standard recogniser. The model results clearly show better performance when the MOC and AR are active and the signal to noise ratio is positive. This supports the idea that these systems contribute to intelligibility in normal human listening conditions. While the recognition performance of the model is inferior to human performance, it is surprisingly robust in the presence of noise compared to many other recognition algorithms. The problems experienced by hearing impaired listeners in noisy environments will be considered in this context.
Although it is convenient to think of the processes of segregating objects and of selecting objects as distinct and separable stages that are engaged when listeners process complex auditory scenes, such a view is overly simplistic. Although selective attention to a sound source requires that the desired source is segregated from a scene, the very act of preparing to listen for a source with a particular attribute (e.g., from a particular location) causes changes in how subsequent inputs are processed, and thus how the scene is analyzed. Moreover, the dynamics of how a scene is parsed are complex and depend upon both volitional attention and automatic processes. In particular, once a given stream of sound is the focus of attention, subsequent sound elements that are perceptually similar are perceptually enhanced, and at least a part of this enhancement is obligatory, rather than volitional. Both psychophysical and neuroimaging evidence in support of these ideas will be reviewed, with an eye toward how such principles might be incorporated into models of auditory scene analysis.
A fair amount of work on understanding sound mixtures is based on learned low-rank decompositions of spectra. Such decompositions are used to analyze mixtures, extract constituent sounds and then perform various subsequent tasks such as pitch tracking, speech recognition, etc. These tasks however are almost always hindered by the quality of the separation step which is not guaranteed to facilitate them. If instead of low-rank models we try to explain mixtures by using verbatim bits from training data, it is possible to not only improve the performance of the separation, but to also carry semantic information from these bits to the mixtures and automatically perform otherwise complex parameter estimation tasks. I will show how such a model is easily computed as a sparse decomposition, how it can be enhanced with various priors, and how it can be used in the context of problems that include mixtures.
Understanding the auditory scene requires that spatial information be extracted from and combined across multiple acoustic cues, such as interaural differences of time (ITD) and level (ILD). In real-world listening, the magnitude and reliability of those cues vary dynamically over the course of brief sounds, and behavioral studies suggest that listeners’ sensitivity to the cues varies accordingly. This presentation reviews several lines of investigation into the temporal dynamics of listeners’ sensitivity to ITD and ILD, and the neural mechanisms that might give rise to dynamic effects. Overall results suggest the involvement of an onset-specific mechanism in the early binaural processing of both cues, and a subsequent temporally integrative mechanism involved in ILD but not ITD processing. That difference will be discussed in terms of the weighting of spatial information across cues in complex and reverberant auditory scenes. (Supported by NIH R03 DC009482)
D. Tollin - Neural sensitivity to interaural level differences determines virtual acoustic space minimum audible angles for single neurons in the lateral superior olive
Because the peripheral receptors of the ear have no mechanism to directly sense sound location on their own (unlike the topographic organization of the retina), location must be computed at more central levels. This makes sound localization a fascinating neuro-computational problem, particularly from a developmental perspective. The minimum audible angle (MAA), the smallest angle separating two sound sources that can be reliably discriminated, is a psychophysical measure of spatial acuity. In humans and cats, MAAs for tone and noise stimuli range from 1-5°. For high-frequency (>1.5 kHz) tones the predominant cue for azimuth is the interaural level difference (ILD). Neurophysiologically, ILDs are first encoded in the lateral superior olive (LSO). Here, we examined the ability of LSO neurons in cats to signal changes in the “azimuth” of noise sources. Using measurements of acoustical head related transfer functions, the virtual acoustic space technique was used to manipulate source azimuth in the physiological experiments. For each neuron signal detection theory was used to compute the smallest increment in azimuth necessary to discriminate that change based on discharge rate and associated response variability. Minimum neural MAAs were 2.3° for midline sources (median = 4.5°, n = 32 neurons). The good neural acuity for spatial location will be explained in terms of the changes in the frequency-specific acoustic ILD cue with changes in source location along with the underlying sensitivity of the LSO neurons to the acoustical ILD cue itself. The results demonstrate that LSO neurons can signal changes in sound azimuth that match or exceeded behavioral capabilities.
R. Turner - Decomposing signals into a sum of amplitude and frequency modulated sinusoids using probabilistic inference
In this talk I will discuss a new method for decomposing a signal into a sum of amplitude and frequency modulated sinusoids. Such representations are often used for producing stimuli for scientific experiments e.g. vocoded signals, chimeric sounds, or sinusoidal speech. Moreover, the representation can form the most primitive stage of a computational auditory scene analysis system.
Classically, there are two main ways to realise such a decomposition. The first is subband demodulation where a signal is first passed through a filter bank, before being demodulated using, for example, the Hilbert Transform. The second approach is sinusoidal modelling which uses a set of heuristics to track the sinusoidal components present in a signal (e.g. the McAulay-Quatieri algorithm).
I will introduce a new approach, which uses probabilistic inference to realise the representation. We show that the new method has several advantages over the classical approaches, suffering fewer artifacts as well as handling noise and missing data naturally. The main drawback is that the new method is computationally more expensive than the classical approaches.
This is joint work with Maneesh Sahani.
Source separation aims to extract the signals of individual sound sources from a given signal. It is one of the hottest topics in audio signal processing, with applications ranging from speech enhancement and robust speech recognition to 3D upmixing and post-production of music. In this talk, we will present the general probabilistic variance modeling framework and discuss its advantages compared to earlier approaches such as ICA, SCA, GMM or NMF. We will show how cues as diverse as harmonicity, timbre, temporal fine structure, spatial location and spatial spread can be jointly exploited by means of hierarchical source models and probabilistic priors. We will illustrate the resulting separation quality via a number of sound examples from the Signal Separation Evaluation Campaign (SiSEC).