NMAH | Smithsonian Speech Synthesis History Project (im

phones and acoustic events; they do not attempt to define a set of acoustic features. The simple system of Rao and Thosar is exceptional in that (like Ingemann's set of Playback rules) it tries to make the most of the regularities which do exist; but the idiosyncratic characteristics of each phone must also be specified by these investigators.

The manner in which these systems translate from the discrete phonetic level to the continuous acoustic level also proves somewhat unsatisfying. The notions 'target' and 'transition' imply that the former characterizes essential aspects of a phone and the latter is a means of connecting one phone smoothly with another. In fact, as is well-known, much of the information at the acoustic level in speech is encoded in the formant transitions, and most of the ingenuity devoted to acoustic rules has had the purpose of providing appropriate transitions for the various form and manner classes. This circumstance does not invalidate the notions 'target' and 'transition' for synthesis by rule in general; it is merely a further indication of the inadequacy of acoustic synthesis by rule.

4.2 Vocal-tract Shape Systems

It appears, then, that there are limitations on the adequacy of a synthesis-by-rule system operating only with the acoustic stage. A number of systems have therefore been developed which incorporate earlier stages in the speech chain.

The next earlier stage in the speech chain is vocal tract shape, which, for a given source of excitation, determines the spectrum of the acoustic output (Fant 1960). Since the acoustics of speech production is complex, it seems plausible that rules for synthesis could be more readily and simply stated in terms of dynamic variations in shape. The speech implied by a sequence of shapes can then be heard with a vocal-tract analog synthesizer.

The general strategy used for synthesis by rule with a vocal tract analog, which parallels the strategy used for acoustic systems, has been to specify a target shape for each phone and to interpolate by some rule between targets. In the system of Kelly and Lochbaum (1962), transition times and target shapes, represented as area functions, are stored for each phone. During the transition, the series of area values for each segment of the vocal tract analog (and also values for excitation parameters and nasal coupling) are obtained by linear interpolation between the target values. There are numerous exceptions to this general principle of operation, most of which are attempts to provide for the effects of coarticulation and centralization. Vowels next to nasal consonants are nasalized throughout. Labials do not have a fixed target shape: the lips are constricted or closed for a period, during which the rest of the tract moves from the previous to the following target. An unstressed vowel has zero duration and its target shape is the average of the shape for the corresponding stressed vowel and that for the neutral or

	SSSHP Contents \| Labs
Smithsonian Speech Synthesis History Project
National Museum of American History \| Archives Center
Smithsonian Institution \| Privacy \| Terms of Use