phones and acoustic events; they do not attempt to define a set of
acoustic features. The simple system of Rao and Thosar is exceptional
in that (like Ingemann's set of Playback rules) it tries to make the
most of the regularities which do exist; but the idiosyncratic
characteristics of each phone must also be specified by these
investigators.
The manner in which these systems translate from the discrete
phonetic level to the continuous acoustic level also proves
somewhat unsatisfying. The notions 'target' and 'transition'
imply that the former characterizes essential aspects of a phone
and the latter is a means of connecting one phone smoothly with
another. In fact, as is well-known, much of the information at the
acoustic level in speech is encoded in the formant transitions, and
most of the ingenuity devoted to acoustic rules has had the purpose
of providing appropriate transitions for the various form and manner
classes. This circumstance does not invalidate the notions 'target'
and 'transition' for synthesis by rule in general; it is merely a
further indication of the inadequacy of acoustic synthesis
by rule.
4.2 Vocal-tract Shape Systems
It appears, then, that there are limitations on the adequacy of a
synthesis-by-rule system operating only with the acoustic stage. A
number of systems have therefore been developed which incorporate
earlier stages in the speech chain.
The next earlier stage in the speech chain is vocal tract shape,
which, for a given source of excitation, determines the spectrum of
the acoustic output (Fant 1960). Since the acoustics of speech
production is complex, it seems plausible that rules for synthesis
could be more readily and simply stated in terms of dynamic variations
in shape. The speech implied by a sequence of shapes can then be
heard with a vocal-tract analog synthesizer.
The general strategy used for synthesis by rule with a vocal tract
analog, which parallels the strategy used for acoustic systems, has
been to specify a target shape for each phone and to interpolate by
some rule between targets. In the system of Kelly and Lochbaum (1962),
transition times and target shapes, represented as area functions, are
stored for each phone. During the transition, the series of area values
for each segment of the vocal tract analog (and also values for
excitation parameters and nasal coupling) are obtained by linear
interpolation between the target values. There are numerous exceptions
to this general principle of operation, most of which are attempts
to provide for the effects of coarticulation and centralization.
Vowels next to nasal consonants are nasalized throughout. Labials
do not have a fixed target shape: the lips are constricted or closed
for a period, during which the rest of the tract moves from the
previous to the following target. An unstressed vowel has zero
duration and its target shape is the average of the shape for the
corresponding stressed vowel and that for the neutral
or
|