NMAH | Smithsonian Speech Synthesis History Project (dk

phonetic segments are added by positioning commands to simulate fo rises for voiceless consonants and high vowels.

Next, a phonetic synthesis-by-rule system derives time functions that characterize the activity of voicing and noise sound sources, and the acoustic resonance properties of the vocal tract. In the Klattalk program, 19 time functions are generated, although only the three lowest formant frequency time functions are shown in Fig. 3. Rules contained in this phonetic realization module begin by selecting targets for each parameter for each phonetic segment. The target is actually a time-varying trajectory in the case of vowels because most English vowels are either diphthongs (consisting of a sequence of two articulatory targets), or include diphthongized offsets. Targets are sometimes modified by rules that take into account features of neighboring segments. Then, transitions between targets are computed according to rules that range in complexity from simple smoothing to a fairly complicated implementation of the locus theory (Delattre et al., 1955; Klatt, 1979b). Most smoothing interactions involve segments adjacent to one another, but there may also be articulatory/ acoustic interaction effects that span more than the adjacent segment -- for example, the Klattalk program includes slow modifications to formant motions to mimic aspects of vowel-to-vowel coarticulation across a short intervening consonant (Öhman, 1966).

Finally, a formant synthesizer (Klatt, 1980) is used to convert this parametric representation into a speech waveform. The nature of the output speech waveform is illustrated by providing a broadband sound spectrogram at the bottom of Fig. 3. Klattalk might have tried to follow Fig. 2 more closely by creating a model of the articulators and a second model of the conversion of articulatory configuration to sound, but at our current state of knowledge, this was judged to be too difficult and computationally costly. Examples of attempts by others to follow an articulatory approach will be described in Sec. I C 2.

The following sections consider the various components of the synthesis-by-rule process in detail. A summary highlighting selected previous work on speech synthesis by rule is presented in block diagram form in Fig. 4. The diagram traces early work on the development of speech synthesizers, rule programs, and laboratory text-to-speech systems [many of the earlier references have been reprinted in Flanagan and Rabiner (1973)]. Several commercial text-to-speech systems are identified at the bottom of the figure (Kurzweil, 1976; Gagnon, 1978; Groner et al., 1982; Bruckert et al., 1983; Magnusson et al., 1984), and their historical origins are suggested by the interconnecting references shown above. Other less expensive text-to-speech systems have been described elsewhere (e.g., Bell, 1983; Kaplan and Lerner, 1985).

A. Early synthesizers: Copying speech

Interest and activity in speech synthesis by mechanical and electrical devices go back a long way (Dudley and Tarnoczy, 1950); the history is well summarized by Flanagan (1972, 1976, 1981). The earliest (static) electrical formant synthesizer appears to have been built by Stewart (1922). Two resonant circuits were excited by a buzzer in this device, permitting approximations to static vowel sounds by adjusting resonance frequencies to the lowest two natural acoustic resonances of the vocal tract (formants) for each vowel.

Speech analysis/synthesis systems were conceived at the Bell Telephone Laboratories in the mid-thirties, culminating in the vocoder (Dudley, 1939), a device for analyzing speech into slowly varying acoustic parameters that could then drive a synthesizer to reconstruct an approximation to the original waveform. This led to the idea for a humanly controlled version of the speech synthesizer, called the "Voder" (Dudley et al., 1939). The Voder, shown in Fig. 5, consisted of keys for selecting a voicing source or noise source, with a foot pedal to control fundamental frequency of voicing vibrations. The source signal was routed through ten bandpass electronic filters whose output levels were controlled by an operator's fingers. The Voder was displayed at the 1939 World's Fair in New York (example 1 of the Appendix). It took considerable skill and practice to play a sentence on the device. Intelligibility was marginal, but potential was clearly demonstrated. However, no modern text-to-speech system uses a set of fixed filter channels to create speech.

Not long thereafter, the "Pattern Playback" synthesizer was developed at the Haskins Laboratories, which permitted converting the patterns seen on broadband sound spectrograms back into sound (Cooper et al., 1951; see also Young, 1948). In the Pattern Playback synthesizer shown in Fig. 6, a tone wheel generated harmonics of a 120-Hz tone, while harmonic amplitudes were controlled over time by the reflectance of painted spectrographic patterns on a moving transparent belt. Franklin Cooper, Alvin Liberman, Pierre Delattre, and their associates experimented with syllable patterns -- at first copied directly from spectrograms and then simplified and stylized -- in an effort to determine the acoustic cues sufficient to induce the perception of various phonetic contrasts (example 2 of the Appendix). The constant pitch made for a somewhat unnatural sound, but intelligibility was more than adequate for their purposes. In fact, words in 20 Harvard sentences were 95% intelligible if spectrograms

	SSSHP Contents \| Labs
Smithsonian Speech Synthesis History Project
National Museum of American History \| Archives Center
Smithsonian Institution \| Privacy \| Terms of Use