NMAH | Smithsonian Speech Synthesis History Project (im

phone, while FA is the target value of this formant stored for the adjacent phone. Hence, the character of the transition depends mainly on variables stored for the ranking phone. Thus, each phone has within its boundaries an initial transition, influenced by the previous phone, and a final transition, influenced by the following phone. A duration is stored for each phone; if it is greater than the sum of the durations of the initial and final transitions calculated for the phone, the target values are used for the steady state portion. If the duration is less than the sum, and the paths of the calculated transitions fail to intersect, they are replaced by a linear interpolation between the initial and final boundary values. But if the paths do intersect, the values for each transition between the boundary value and the intersection are used, and the others discarded. Thus the formants of shorter vowels do not attain their targets; their frequencies are context-dependent, as in natural speech (Shearme and Holmes 1962; Lindblom 1963).

Denes (1970) uses a similar scheme, the boundary values being dependent on the target values and on a weight assigned to each phone. Our own system (Mattingly 1968a, b) also uses a scheme like that of Holmes et al., except that interpolation is done according to a simple non-linear equation which assures that formants curve sharply near boundaries. The formant transitions in Rabiner's (1967) system, the most serious attempt to simulate natural formant motion, are calculated according to a critically damped second degree differential equation. The manner in which a formant moves from its initial position towards the next target depends on a time constant of the equation, which is specified for each formant and each possible pair of adjacent phones. When all formants have arrived within a certain distance of the current target, they start to move toward the following target, unless a delay (permitting closer approximation or attainment of the target) is specified. It is not obvious that schemes for non-linear motion offer any great advantage over linear schemes. While a non-linear rule results in formant movements which are more naturalistic, they do not seem to be necessarily perceptually superior to, or even distinguishable from, linear movements. If the formant moves between appropriate frequencies over an appropriate time-period, the manner of its motion does not seem to be too important.

In Rao and Thosar's (1967) system, each phone is characterized by a set of 'attributes', i.e. features of a sort. A phone is either a vowel or a consonant; vowels are front or back; consonants are stops or fricatives; voiced or unvoiced; labial, dental or palatal. Transition patterns depend on these attributes and on the duration and steady-state spectral values stored for each phone. Vowel-vowel transitions are linear from steady state to steady state, and the two temporal variables -- total transition time and the fraction of the total within the duration of the earlier vowel -- are the same for all pairs of vowels. For consonant-vowel transition, the boundary value for each formant is equal to F(FL) + (1-F)FV, where FV is the target frequency of the vowel, FL is the consonant locus frequency and F is a weighting factor. Transition time and F1 locus depend on the value

	SSSHP Contents \| Labs
Smithsonian Speech Synthesis History Project
National Museum of American History \| Archives Center
Smithsonian Institution \| Privacy \| Terms of Use