NMAH | Smithsonian Speech Synthesis History Project (dk

and amplitude of the voicing source so as to mimic the fluctuations seen in the sentence, Holmes spent a long time carefully adjusting formant frequencies and amplitudes on a trial-and-error basis (see Fig. 9). He found that much of the detailed period-to-period variability in the spectra of natural speech can be mimicked by proper adjustments to the amplitudes of parallel formants -- even though we may not as yet have a good enough theory and source model to account for all of this natural variation. According to Holmes, the observed irregularities in the spectrum between the formant peaks are of little perceptual importance; only the strong harmonics near a formant peak and below F1 must be synthesized with the correct amplitudes in order to mimic an utterance with a high degree of perceptual fidelity. Holmes also showed that phase relations among harmonics of the voicing source are important for earphone listening, but not when loudspeakers are used and the sound is modified by the reverberation of ordinary room acoustics. The Holmes synthesizer has recently been implemented on a real-time signal processing chip (Quarmby and Holmes, 1984).

Translation of the Holmes voice imitating abilities into rules for automatic synthesis of natural voice qualities has not, as yet, been successfully achieved. His parallel synthesizer is clearly up to the job, at least for male voices, so the problem remains one of developing an appropriate theory of control. Of course, it may be that the right theory will suggest a quite different model, such as an articulatory synthesizer.

3. Models of the voicing source

The voicing sound source used in a formant synthesizer has evolved from the simple sawtooth waveforms and filtered impulse train used in early designs. An impulse train filtered by a two-pole low-pass filter, displayed at the top in Fig. 10, has about the right average spectrum, but the phase of this waveform is wrong. Primary excitation of the vocal tract filters occurs at a time corresponding to the instant the folds open, rather than at closure. Furthermore, the spectrum envelope is perfectly regular (i.e., monotonically

	SSSHP Contents \| Labs
Smithsonian Speech Synthesis History Project
National Museum of American History \| Archives Center
Smithsonian Institution \| Privacy \| Terms of Use