NMAH | Smithsonian Speech Synthesis History Project (dk

Potter et al. (1947) collected sets of spectrograms depicting all of the vowels and consonants of English, and suggested ways in which to interpret the patterns they observed. They created a terminology that included terms in use today such as "stop gap" and "voice bar." In attempting to extract a common property for a stop consonant before different vowels, they defined the concept of the "hub." The "hub" is the ideal value for the second formant in each consonant. According to their observations, the second formant hub was quite useful in distinguishing between consonants having different places of articulation in English (e.g., /b/ vs /d/ vs /g/). The authors observed a fairly constant hub for /b/ before different vowels, see examples in Fig. 15, 3 and for /d/, but they said the hub for /g/ was variable across vowel context.

The investigation of the perceptual importance of various acoustic cues to a given phonetic contrast began with the use of the Pattern Playback machine at Haskins Laboratories (Cooper et al., 1951). Delattre, Liberman, Cooper, and their associates created stylized versions of syllables in an effort to determine the acoustic cues sufficient for the synthesis of selected phonetic contrasts. This extensive line of research culminated in a publication suggesting explicit rules for the synthesis of English speech sounds, in which Frances Ingemann collected together a body of "synthesis-by-art" knowledge that was based on experience with the Pattern Playback (Liberman et al., 1959).

The research suggested the importance of formant frequencies, formant frequency motions, spectral peaks in noise bursts, and the relative timing of onsets in different frequency regions as cues for voicing, manner, and place of articulation of consonants. The researchers emphasized the encoded nature of speech (Liberman et al., 1967) in that the acoustic cues to the identity of a phoneme were spread out in time so as to overlap with cues for adjacent phonemes, and the cues were context dependent -- for example the same plosive burst spectrum was heard as a different consonant depending on the vowel pattern that followed (Cooper et al., 1952). There appeared to be no one invariant acoustic cue signaling the presence of a given stop consonant; rather the consonantal identity would have to be inferred from the formant transitions into an adjacent vowel. The most interesting descriptive solution to this perceptual paradox was the locus theory (Delattre et al., 1955), which characterized the onset frequency of the second formant motion for a consonant-vowel transition in terms of an invisible consonant locus. The locus was determined by extrapolating backward about 50 ms from observed formant transitions for a given consonant before various vowels, Fig. 16. The importance of a virtual

	SSSHP Contents \| Labs
Smithsonian Speech Synthesis History Project
National Museum of American History \| Archives Center
Smithsonian Institution \| Privacy \| Terms of Use