NMAH | Smithsonian Speech Synthesis History Project (dk

for frication noise excitation, Fig. 7. These circuits represent constraining idealizations/ simplifications compared with underlying acoustic theory; it remained to be shown whether the new model was capable of synthesizing highly intelligible versions of consonants in various languages of the world.

The designers of the original PAT and OVE disagreed on whether the transfer function of the acoustic tube formed by the vocal tract should be modeled by a set of formant resonators connected in cascade (Fant, 1953, 1956, 1959, 1960) or connected in parallel (Lawrence, 1953; see also Holmes, 1973). The authors were in complete agreement as to the theory (see Flanagan, 1957, for a discussion of the mathematical relations between the two approaches) but disagreed on practical matters concerning whether it was possible to approximate vowel nasalization adequately in a cascade model, or how to avoid peculiarities in the transfer function produced by a parallel configuration when formant amplitude control settings were not perfect. The arguments persist, although at a much more sophisticated level (Holmes, 1983).

Modern synthesizers have largely abandoned electronic circuitry in favor of simulation on a digital computer (Gold and Rabiner, 1968) or construction of special-purpose digital hardware. Designs have added subtleties such as an ability to amplitude modulate the noise in a voiced fricative due to the modulation of the air stream induced by the vibrating vocal folds (Maxey, 1963; Rabiner, 1968), and have added more variable control parameters, but have otherwise not changed greatly (see references cited in Klatt, 1980). The desirability of using a hybrid synthesizer with cascaded formants (and an extra pole-zero pair for mimicking nasalization) for synthesis of sonorants, and parallel formants (with the same formant frequency values) for synthesis of obstruents was proposed by Klatt (1972). Klatt argued that the quantal theory of consonant place of articulation (Stevens, 1972) could be implemented directly by simple rules in such a synthesizer. The publication of this synthesizer as a Fortran listing (Klatt, 1980) promoted its use for perceptual experimentation in many laboratories, facilitating replication of stimuli and experimental results.

An important milestone in the development of speech synthesizers was the demonstration that synthetic speech could be so good that the average listener could not tell the difference between a synthetic and natural sentence when presented with both in sequence (example 8 of the Appendix). The demonstration occurred at the 1972 Boston Speech Communication Conference when John Holmes described a new version of a parallel formant synthesizer (Holmes, 1973). Holmes had spent a winter much earlier working with OVE II to synthesize a good copy of the sentence "I enjoy the simple life" spoken by a man, but had more difficulty with a female utterance (Holmes, 1961) (example 7 of the Appendix). Considering his experience with both cascade and parallel formant models, it is interesting to note that Holmes now much prefers the parallel model shown in Fig. 8 when the objective is to match a natural recording of a particular speaker. His argument, which is somewhat complex, is presented in detail in Holmes (1973, 1983). In essence, he showed that it is desirable to use a voicing waveform based on that of the speaker being modeled. This waveform can be obtained by inverse filtering vowels produced by the speaker to be imitated (the inverse filter, when properly adjusted, cancels the acoustic effects of the vocal tract transfer function). Holmes noted that stylized glottal pulses of the type used in conventional formant synthesizers work nearly as well. After adjusting the frequency

	SSSHP Contents \| Labs
Smithsonian Speech Synthesis History Project
National Museum of American History \| Archives Center
Smithsonian Institution \| Privacy \| Terms of Use