NMAH | Smithsonian Speech Synthesis History Project (dk

devices suggests that either something is still missing from the voicing source models, or that we do not yet know how to control them properly.

A number of recent glottal waveform models produce source spectra that include zeros (see Fujisaki, 1986 for a review). Flanagan (1972, pp. 232-245) describes the expected locations of voicing source spectral zeros as a function of various assumptions about the nature of the glottal volume velocity waveform. Many different types of waveshapes imply the existence of zeros; the only requirement is that there be well-defined open and closing times. If a source spectral zero is near in frequency to a formant, the formant will be reduced in amplitude or even completely obliterated. Source spectral zeros are present in the glottal waveform models of Fant et al. (1985) and in Klattalk, but the depth of the spectral notches is only a few decibels. Flanagan shows that the frequency locations and depth of spectral notches induced by source zeros depend on relatively small changes to critical aspects of the source waveform, such as symmetry. It may be that the dull, lifeless quality of synthetic voices is due in part to the absence of small period-to-period changes to the zero pattern. Holmes (1973) was able to synthesize a nearly perfect imitation of a male voice without resorting to this level of detail in modeling the source, but he may have mimicked the most important effects of source changes by ensuring that the amplitudes of individual formant spectral peaks followed changes observed in the natural utterance.

Naturalness is a particular problem when trying to synthesize a convincing imitation of a female voice (Carrell, 1984). Simple scaling procedures [formants multiplied by a factor of 1.15 (Peterson and Barney, 1952), fundamental frequency by a factor of 1.7, glottal open quotient slightly greater than for a male voice] do not result in a particularly female voice quality (example 9 of the Appendix). The glottal source model is not quite right; nonuniform formant scaling appears to be required (Fant, 1975), and it may also be

	SSSHP Contents \| Labs
Smithsonian Speech Synthesis History Project
National Museum of American History \| Archives Center
Smithsonian Institution \| Privacy \| Terms of Use