Syllable-based speech recognition using auditorylike features

Journal of the Acoustical Society of America,Vol. 105, No. 2, Pt. 2, pp. 1157-1158, 1999

Syllable-based speech recognition using auditorylike features

S. Greenberg, T. Arai, B. Kingsbury, N. Morgan, M. Shire, R. Silipo and S. Wu

Abstract: Classical models of speech recognition (both human and machine) assume that a detailed, short-term analysis of the signal is essential for accurate decoding of spoken language via a linear sequence of phonetic segments. This classical framework is incommensurate with quantitative acoustic/phonetic analyses of spontaneous discourse (e.g., the Switchboard corpus for American English). Such analyses indicate that the syllable, rather than the phone, is likely to serve as the representational interface between sound and meaning, providing a relatively stable representation of lexically relevant information across a wide range of speaking and acoustic conditions. The auditory basis of this syllabic representation appears to be derived from the low-frequency (2-16Hz) modulation spectrum, whose temporal properties correspond closely to the distribution of syllabic durations observed in spontaneous Speech. Perceptual experiments confirm the importance of the modulation spectrum for understanding spoken language and demonstrate that the intelligibility of speech is derived from both the amplitude and phase components of this spectral representation. Syllable-based automatic speech recognition systems，currently under development，have proven useful under various acoustic conditions representative of the real world (such as reverberation and background noise) when used in conjunction with more traditional, phone-based recognition systems.

[PDF (46 kB)]