Proc. of the International Conf. on Spoken Language Processing (ICSLP), Vol. 6, pp. 2803-2806, Sydney, 1998
Speech intelligibility derived from exceedingly sparse spectral information
S. Greenberg, T. Arai and R. Silipo
Abstract: Traditional models of speech assume that a detailed auditory analysis of the short-term acoustic spectrum is essential for understanding spoken language. The validity of this assumption was tested by partitioning the spectrum of spoken sentences into 1/3-octave channels (“slits”) and measuring the intelligibility associated with each channel presented alone and in concert with the others. Four spectral channels, distributed over the speech-audio range (0.3-6 kHz) are sufficient for human listeners to decode sentential material with nearly 90% accuracy although more than 70% of the spectrum is missing. Word recognition often remains relatively high (60-83%) when just two or three channels are presented concurrently, despite the fact that the intelligibility of these same slits, presented in isolation, is less than 9% (Figure 2). Such data suggest that the intelligibility of spoken language is derived from a compound “image” of the modulation spectrum distributed across the frequency spectrum (Figures 1 and 3). Because intelligibility seriously degrades when slits are desynchronized by more than 25 ms (Figure 4) this compound image is probably derived from both the amplitude and phase components of the modulation spectrum, and implies that listeners’ sensitivity to the modulation phase is generally “masked” by the redundancy contained in full-spectrum speech (Figure 5).