4.1 Acoustic-level Systems
We have already mentioned some systems in which the phonetic level
is mapped directly on to the acoustic level, including one, that of
Kelly and Gerstman (1961), which, like other more recent systems of
this kind, is parametric. In these systems a target spectrum for
each phone 9 is specified by a set of stored parameter values. Given
a phonetic transcription of an utterance, the synthesis program
calculates the momentary changes of value for each parameter from
target to target as a function of time. (Notice that this is an
extremely natural way to treat the problem of translating from the
discrete to the continuous domain.)
The most important differences among the various systems have to do
with the procedures for this calculation, and in particular, the
procedure for calculating formant motion, since intelligibility
depends crucially on the choice of targets toward which the formants
move, and the timing of their movements. In the Kelly-Gerstman
system, it will be recalled, an initial transition duration, a final
transition duration and a steady-state duration are stored for each
phone. The duration of a transition between two adjacent phones is
the sum of the final transition duration of the first phone and the
initial transition duration of the next. During the steady-state
period, formants remain at their target values; during the transition
period, they move from one set of target values to the next, following
a convex path from consonant to vowel, a concave path from vowel
to consonant, and a linear path otherwise.
In the system of Holmes et al. (1964), a 'rank' is stored
for each phone, corresponding to its manner class. Manner classes
having characteristic transitions (e.g. stop consonants) rank high;
manner classes for which the character of the transition is
characterized by the adjacent phone rank low. The character of the
transition between adjacent phones is determined according to the
ranking phone. Each transition is calculated by linear interpolation
between a target value for the first phone and a boundary value,
and between the boundary value and the target for the second phone.
The durations of the two parts of the transition are stored for the
ranking phone. The boundary value is equal to CR
+ WR(FA), where
CR is a constant and WR
a weighting factor for this formant stored
for the ranking
__________
9. Workers in synthesis by rule (including the author) have been in
the habit of referring to the units of their input transcriptions
as 'phonemes'. In most cases, these units do not correspond either
to the phonemes of structural linguistics or to the phonological
segments of generative phonology; they tend to be closer to the level
of a broad phonetic transcription. We use the term phone except in
the case of systems where a deliberate distinction is attempted
between phonological and phonetic levels.