NMAH | Smithsonian Speech Synthesis History Project (dk

(jaw, tongue body relative to jaw, and tongue tip relative to tongue body) all had to be sent to appropriate targets at times that took into account their relative masses and available muscular forces (Coker, 1976). Modern three-dimensional models of the articulators now solve this particular problem of control precision and coordination by grooving the tongue at the midline before forcing it up against the roof of the mouth (Fujimura and Kakita, 1979). However, a general solution to the problem of seeking target articulatory shapes via sets of dependent articulators seems to require control strategies incorporating considerable knowledge of the dynamic constraints on the system and selection of an optimal control strategy from a multiplicity of alternative ways to achieve a desired goal.

Several novel articulation-based synthesis-by-rule programs were developed at this time. Nakata and Mitsuoka (1965) attempted to implement the idea that an intervocalic consonant is a gesture superimposed on an underlying vowel-vowel transition. Henke (1967) proposed an articulatory strategy in which articulators not constrained by the present segmental configurational goals are free to look ahead and begin to seek articulatory goals of upcoming segments. In this way, anticipatory lip rounding and other segmental interactions might be explained on general principles. There is currently considerable disagreement as to the extent to which articulators are free to participate in such lookahead strategies, and as to the number of segments over which lookahead is possible. Finally, Hiki (1970) simulated the muscular control of the articulators in order to be able to specify articulation in terms of motor control signals. This would be a very attractive model if it were the case that the motor commands for a segment were invariant with phonetic context, but unfortunately, electromyographic data indicate that this is far from the case (MacNeilage and DeClerk, 1969).

An entire text-to-speech system for English based on an articulatory model was created in Japan (Teranishi and Umeda, 1968; Matsui et al., 1968) (example 24 of the Appendix). The text analysis and pause assignment rules of this system were based on a sophisticated parser (Umeda and Teranishi, 1975). Using a dictionary of 1500 common words found useful for parsing, the program checked each sentence for length; if it was greater than about ten syllables, it was subdivided into smaller "breath groups" separated by pauses. Some of these rules were later modified slightly and combined with the Coker articulatory rules to produce a text-to-speech system at Bell Laboratories (Coker et al., 1973; Umeda, 1976). The Bell Labs system was notable for its attention to detail in the specification of segmental durations and allophonic variation (example 25 of the Appendix).

While it is possible to generate fairly natural sounding speech using a modern articulatory synthesizer (Flanagan et al., 1975; Flanagan and Ishizaka, 1976, 1978), rule-based articulatory synthesis programs have been difficult to optimize. This seems to be due in part to the unavailability of sufficient data on the motions of the articulators during speech production. Even so, the strategies developed to control such a synthesizer may reveal interesting aspects of articulatory control during the production of natural speech (Mermelstein, 1973; Coker, 1976).

3. Rule compilers

Carlson and Granström (1975, 1976) developed a special programming language to permit linguists to formulate synthesis rules in a natural way, similar to the Chomsky and Halle (1968) formalism. An important advantage of the language is an ability to refer to natural sets of phonemes through a distinctive feature notation, making rule statement simple, efficient, and easy to read. These rules are then compiled automatically into a synthesis-by-rule program. A number of languages (Swedish, Norwegian, American English, British English, Spanish, French, German, and Italian) have been synthesized using this system (Carlson and Granström, 1976; Carlson et al., 1982a), and the resulting system has been brought out as a product, the Infovox SA-101 (example 31 of the Appendix). A similar approach has been developed by Hertz (1982), who has used her programming facility to synthesize English and Japanese.

Hertz et al. (1985) believe that powerful new rule compilers are needed in text-to-speech systems in order to take advantage of recently proposed linguistic structures such as "three-dimensional" phonology (Halle and Vergnaud, 1980; Clements, 1985). Programmers of synthesis-by-rule systems have always faced the problem that the abstract representation for a sentence is not simply a linear string of symbols. Some rules want to manipulate phonetic segments (while ignoring stress and syntactic symbols), while other rules have a domain that is closer to syllables (or syllable onsets and rhymes), and other rules deal with whole words and phrases. One solution has been to order rules so that it is possible to erase syntactic structure after all syntactic rules

	SSSHP Contents \| Labs
Smithsonian Speech Synthesis History Project
National Museum of American History \| Archives Center
Smithsonian Institution \| Privacy \| Terms of Use