|KLATT 1987, p. 742|
were copied directly onto the transparent belt. The same words were about 85% intelligible after the spectrographic patterns had been schematized according to hypotheses about the most important aspects of observed patterns (Cooper et al., 1951). Important early discoveries at Haskins are discussed in a later section.
1. The source-filter theory of speech generation
The Voder and Pattern Playback were methods for copying the time-varying spectral patterns of speech. A critical next step in the history of speech synthesis was the development of an acoustic theory of how speech is produced (summarized in Fant, 1960) and the design of formant and articulatory synthesizers based on this theory. The acoustic theory of speech production, in its simplest form, states that it is possible to view speech as the outcome of the excitation of a linear filter by one or more sound sources. The primary sources of sound are voicing, caused by the vibration of the vocal folds, and turbulence noise caused by a pressure difference across a constriction. The linear filter simulates the resonance effects of the acoustic tube formed by the pharynx, oral cavity, and lips. This vocal tract transfer function can be modeled by a set of poles -- each complex conjugate pair of poles producing a local peak in the spectrum, known as a formant. At times the representation of the vocal tract transfer function in terms of a product of poles has to be augmented with zeros (antiresonators) to model the sound absorbing properties of side-branch tubes in complex articulations such as nasals, nasalized vowels, and fricatives (Fant, 1960).
2. Models of the vocal tract transfer function
Some speech synthesizers based on this acoustic theory use both poles (formant resonators), and zeros (antiformants) to model the vocal tract transfer function, while other models have tried to avoid the necessity of zeros. It has been argued that spectral notches caused by transfer function zeros are hard to detect auditorily (Malme, 1959), and therefore that the primary acoustic/ perceptual effect of a zero is its influence on the amplitude of any nearby formant peak. If this assumption is true, then one may not require zero circuits in a synthesizer, as long as it is possible to adjust the amplitudes of formant peaks appropriately based on a knowledge of where the zeros of the transfer function should be. This simplification has led to a parallel formant synthesizer as one popular method for modeling the vocal tract transfer function. The outputs of a set of resonators connected in parallel are summed, and the input sound source amplitude of each formant resonator is determined by an independent control parameter.
The first formant synthesizers to be dynamically controlled were Walter Lawrence's Parametric Artificial Talker ("PAT") and Gunnar Fant's Orator Verbis Electris ("OVE I") (Lawrence, 1953; Fant, 1953). PAT consisted of three electronic formant resonators connected in parallel, whose inputs were either a buzz or noise. A moving glass slide was used to convert painted patterns into six time functions to control the three formant frequencies, voicing amplitude, fo, and noise amplitude. OVE I, on the other hand, consisted of formant resonators connected in series, the lowest two of which were varied in frequency by movements in two dimensions of a mechanical arm. The amplitude and fo of the voicing source were determined by hand-held potentiometers. OVE I was restricted to the production of vowel-like sounds. PAT and OVE I engaged in an amusing conversation at a conference at MIT in 1956 (examples 3 and 4 of the Appendix).
Improvements were made in the synthesizers and control strategies
over the next few years, so that when PAT and OVE met again on the
stage at the 1962 Stockholm Speech Communication Conference, both
were capable of a remarkably close approximation to a human sentence
(examples 5 and 6 of the
Appendix). PAT was first modified to have
individual formant amplitude controls and a separate circuit for
fricatives; it was later converted to cascade operation (Anthony
and Lawrence, 1962). OVE I had evolved into OVE II (Fant and Martony,
1962), which included a separate static branch to simulate nasal
murmurs and a special cascade of two formants and one antiformant
to simulate a simplified approximation to the vocal tract transfer
|KLATT 1987, p. 742|