NMAH | Smithsonian Speech Synthesis History Project (im

4.1 Acoustic-level Systems

We have already mentioned some systems in which the phonetic level is mapped directly on to the acoustic level, including one, that of Kelly and Gerstman (1961), which, like other more recent systems of this kind, is parametric. In these systems a target spectrum for each phone 9 is specified by a set of stored parameter values. Given a phonetic transcription of an utterance, the synthesis program calculates the momentary changes of value for each parameter from target to target as a function of time. (Notice that this is an extremely natural way to treat the problem of translating from the discrete to the continuous domain.)

The most important differences among the various systems have to do with the procedures for this calculation, and in particular, the procedure for calculating formant motion, since intelligibility depends crucially on the choice of targets toward which the formants move, and the timing of their movements. In the Kelly-Gerstman system, it will be recalled, an initial transition duration, a final transition duration and a steady-state duration are stored for each phone. The duration of a transition between two adjacent phones is the sum of the final transition duration of the first phone and the initial transition duration of the next. During the steady-state period, formants remain at their target values; during the transition period, they move from one set of target values to the next, following a convex path from consonant to vowel, a concave path from vowel to consonant, and a linear path otherwise.

In the system of Holmes et al. (1964), a 'rank' is stored for each phone, corresponding to its manner class. Manner classes having characteristic transitions (e.g. stop consonants) rank high; manner classes for which the character of the transition is characterized by the adjacent phone rank low. The character of the transition between adjacent phones is determined according to the ranking phone. Each transition is calculated by linear interpolation between a target value for the first phone and a boundary value, and between the boundary value and the target for the second phone. The durations of the two parts of the transition are stored for the ranking phone. The boundary value is equal to CR + WR(FA), where CR is a constant and WR a weighting factor for this formant stored for the ranking
__________
9. Workers in synthesis by rule (including the author) have been in the habit of referring to the units of their input transcriptions as 'phonemes'. In most cases, these units do not correspond either to the phonemes of structural linguistics or to the phonological segments of generative phonology; they tend to be closer to the level of a broad phonetic transcription. We use the term phone except in the case of systems where a deliberate distinction is attempted between phonological and phonetic levels.

	SSSHP Contents \| Labs
Smithsonian Speech Synthesis History Project
National Museum of American History \| Archives Center
Smithsonian Institution \| Privacy \| Terms of Use