|KLATT 1987, p. 743|
for frication noise excitation, Fig. 7. These circuits represent constraining idealizations/ simplifications compared with underlying acoustic theory; it remained to be shown whether the new model was capable of synthesizing highly intelligible versions of consonants in various languages of the world.
The designers of the original PAT and OVE disagreed on whether the transfer function of the acoustic tube formed by the vocal tract should be modeled by a set of formant resonators connected in cascade (Fant, 1953, 1956, 1959, 1960) or connected in parallel (Lawrence, 1953; see also Holmes, 1973). The authors were in complete agreement as to the theory (see Flanagan, 1957, for a discussion of the mathematical relations between the two approaches) but disagreed on practical matters concerning whether it was possible to approximate vowel nasalization adequately in a cascade model, or how to avoid peculiarities in the transfer function produced by a parallel configuration when formant amplitude control settings were not perfect. The arguments persist, although at a much more sophisticated level (Holmes, 1983).
Modern synthesizers have largely abandoned electronic circuitry in favor of simulation on a digital computer (Gold and Rabiner, 1968) or construction of special-purpose digital hardware. Designs have added subtleties such as an ability to amplitude modulate the noise in a voiced fricative due to the modulation of the air stream induced by the vibrating vocal folds (Maxey, 1963; Rabiner, 1968), and have added more variable control parameters, but have otherwise not changed greatly (see references cited in Klatt, 1980). The desirability of using a hybrid synthesizer with cascaded formants (and an extra pole-zero pair for mimicking nasalization) for synthesis of sonorants, and parallel formants (with the same formant frequency values) for synthesis of obstruents was proposed by Klatt (1972). Klatt argued that the quantal theory of consonant place of articulation (Stevens, 1972) could be implemented directly by simple rules in such a synthesizer. The publication of this synthesizer as a Fortran listing (Klatt, 1980) promoted its use for perceptual experimentation in many laboratories, facilitating replication of stimuli and experimental results.
An important milestone in the development of speech synthesizers
was the demonstration that synthetic speech could be so good that
the average listener could not tell the difference between a
synthetic and natural sentence when presented with both in sequence
(example 8 of the
Appendix). The demonstration occurred
at the 1972
Boston Speech Communication Conference when John Holmes described a
new version of a parallel formant synthesizer (Holmes, 1973). Holmes
had spent a winter much earlier working with OVE II to synthesize a
good copy of the sentence "I enjoy the simple life" spoken by a man,
but had more difficulty with a female utterance (Holmes, 1961)
(example 7 of the
Appendix). Considering his experience with both
cascade and parallel formant models, it is interesting to note that
Holmes now much prefers the parallel model shown in
Fig. 8 when the
objective is to match a natural recording of a particular speaker.
His argument, which is somewhat complex, is presented in detail in
Holmes (1973, 1983). In essence, he showed that it is desirable to
use a voicing waveform based on that of the speaker being modeled.
This waveform can be obtained by inverse filtering vowels produced
by the speaker to be imitated (the inverse filter, when properly
adjusted, cancels the acoustic effects of the vocal tract transfer
function). Holmes noted that stylized glottal pulses of the type
used in conventional formant synthesizers work nearly as well. After
adjusting the frequency
|KLATT 1987, p. 743|