NMAH | Smithsonian Speech Synthesis History Project (dk

intelligibility of nasals and fricatives, at a relatively small cost in naturalness.

A few years after the 1976 transfer of code to Telesensory Systems, Klatt used the Hunnicutt (1980) letter-to-phoneme rules in the design of a complete text-to-speech system, known as Klattalk (Klatt, 1981, 1982a). The system included a 6000-word exceptions dictionary for common words that failed letter-to-sound conversion, and a crude parser. Klattalk software was then licensed to Digital Equipment Corporation in 1982. Digital announced the DECtalk commercial text-to-speech system in 1983 (Bruckert et al., 1983). In designing DECtalk hardware, Digital engineers included sufficient power and flexibility to be able to plug in improved versions of the Klattalk code as they became available in succeeding years (Conroy et al., 1986) (example 33 of the Appendix).

The most recent version of the Klattalk program includes rules to implement such phonetic details as schwa offglides for lax vowels, nasalization of vowels (the splitting of F1 into a pole-zero-pole complex) adjacent to nasal consonants, postvocalic allophones for sonorant consonants, variations in voice onset time as a function of syllable structure and stress, target undershoot for short segments (Lindblom, 1963), vowel-vowel coarticulation across an intervening consonant (Öhman, 1966), and breathy offsets to utterances.

Several different voices are provided in Klattalk to approximate the speaking characteristics of men, women, and children. Detailed formant data are stored for only two voices, a man's and a woman's; other male and female voices are created by scaling formant frequencies for different vocal tract sizes and by adjusting an extensive set of synthesis parameters concerned with the voicing source. However, in spite of an ability to modify average fo, fo range, spectral tilt, glottal open quotient, and breathiness, a truly feminine voice quality remains elusive (example 35 of the Appendix). The DECtalk implementation of Klattalk permits the user to modify characteristics of eight preset voices (Conroy et al., 1986).

Apparently oblivious to all of the prior research detailed earlier, a man experimenting in his basement workshop, Richard Gagnon, designed a synthesis-by-rule program that eventually resulted in the Votrax SC-01 chip (Gagnon, 1978; Bassak, 1980). The chip has been interfaced with the Elovitz et al. (1976) text-to-phoneme rules (Morris, 1979) and used in several inexpensive text-to-speech products (Sherwood, 1979), the best known of which is the Votrax Type-n-Talk. It is a remarkable device for the price. The chip includes both a cascade formant synthesizer and simple lowpass smoothing circuits for generating continuous time functions to control the synthesizer from a step-function representation derived from target values stored in tables for each phoneme of a somewhat nonstandard phonetic inventory. The latest version of the chip, the SC-1A is used in the Votrax Personal Speech System (example 28 of the Appendix). The new chip is said to have improved intelligibility over the SC-01, but the intelligibility is not nearly as good as obtained in the other systems, and sentence-level rules for prosody and phonetic recoding are not as extensive (see performance evaluation section below).

Another type of chip, the Texas Instrument's TMS-5220 linear prediction synthesizer, forms the basis for a second inexpensive product, the Echo text-to-speech system (example 29 of the Appendix). This system appears to use concatenated diphones obtained by excising chunks from natural speech (Peterson et al., 1958; Dixon and Maxey, 1968; Olive, 1977), see below.

A noteworthy early commercial system, the Kurzweil reading machine for the blind, was announced as a product in 1976 (Kurzweil, 1976). It is reputed to have an excellent multifont text reading capability. While admirable in its aspirations, the speech produced by the first versions of this device, which employed phonemic synthesis strategies based on Votrax, was only marginally intelligible (example 27 of the Appendix). Kurzweil currently uses the Prose-2000 as the synthesis hardware in its reading machines.

2. Articulation-based rule programs

A synthesis-by-rule program that manipulates parameters such as formant frequencies according to heuristic rules is not a very close model of the way that people speak. In the hope that a more realistic articulatory model might lead to simpler more elegant rules, several research groups have attempted to devise simplified models of the articulators or models of the observed shape of the vocal tract. The first such model (Kelly and Lockbaum, 1962) used stored tables of area functions (cross-sectional area of the vocal tract from larynx to lips) for each phonetic segment and a linear interpolation scheme. The authors had begun to assemble a list of special case exceptions needed to make this type of strategy work better, such as not constraining the vocal tract except at the lip section when synthesizing a labial stop, and including separate shapes for velars before front and back vowels. Still, the intelligibility of the synthesis was said to be not nearly as good as Kelly and Gerstman had obtained with a formant-based rule program (unfortunately, I have been unable to locate a recording of this system).

Based on the success of Stevens and House (1955) in developing a three-parameter description of vocal tract shapes capable of describing English vowels, the next more ambitious articulatory models abandoned direct specification of an area function in favor of an intermediate model possessing a small set of movable structures corresponding to the tongue, jaw, lips, velum, and larynx. Rules for converting phonetic representations to signals for the control of the position of quasi-independent articulators in an articulatory synthesizer were then developed in several laboratories (Nakata and Mitsuoka, 1965; Henke, 1967; Coker, 1968; Werner and Haggard, 1969; Mermelstein, 1973). The Coker rules were demonstrated at the 1967 M.I.T. Conference on Speech Communication and Processing (example 19 of the Appendix).

Coker found the system to be challenging to work with. For example, in his model shown in Fig. 22, the tongue body position was relative to jaw opening, and the location of the tongue tip was relative to the computed coordinates of the tongue body. If the objective were to make a narrow constriction for, e.g., /s/, several semi-independent articulators

	SSSHP Contents \| Labs
Smithsonian Speech Synthesis History Project
National Museum of American History \| Archives Center
Smithsonian Institution \| Privacy \| Terms of Use