|KLATT 1987, p. 762|
addition, the structure of discourse seems to cause readers to start with a higher fo at the beginning of a paragraph (Lehiste, 1975b).
In addition to the rule-governed changes to fundamental frequency over a sentence, there are also local perturbations due to aspects of segmental articulation. The fo contour is higher near a voiceless consonant than near a voiced consonant, and is higher on a high vowel (House and Fairbanks, 1953; Peterson and Barney, 1952), although this latter effect may be reduced in sentence contexts (Umeda, 1981).
For synthesis by rule, what is needed is a theory that can predict when fo will rise or fall, and what levels it will reach on individual stressed syllables of a sentence as a function of syntactic structure, stress pattern, and semantic/ performance variables (if known) such as the location of the most important word in the sentence, or the speaker's attitude toward what is being said. Such theories are still in their infancy, and many alternative formulations exist, but fortunately several are complete enough to serve as models for a text-to-speech algorithm. One type of theory is based on the view that fo moves (sluggishly) from target to target tone (Pike, 1945). Another class of theories includes commands to raise and lower fo at certain times, emphasizing the motion over the actual target achieved (Bolinger, 1951), see also Ladd (1983).
The first algorithm for determination of a fundamental frequency contour was programmed by Mattingly (1966) and incorporated in the phonemic synthesis-by-rule program of Holmes et al. (1964). In the British tradition of Armstrong and Ward (1931), which separates intonation and stress, Mattingly's rules recognized three intonational "tunes" that could be placed on the last prominent syllable of a clause. The tunes, shown at the top in Fig. 25, are "falling," "rising," and "fall-rise" -- corresponding to statement end, question end, and continuation rise. Other prominent syllables of a sentence (typically the stressed syllable in semantically important content words) could be marked by the user; in which case these received a local increase in fo. Unstressed syllables were generally lower in pitch because they were not assigned a target.
These rules were intended to mimic intonation patterns of British English; an American version was published later by Mattingly (1968). In this rule system, the tendency for fo to start high and fall gradually throughout a sentence (declination) was reduced for American English, and the prominent/ nonprominent opposition was elaborated by distinguishing three stress levels (primary, secondary, and unstressed). 7 The influence of consonants on fo (Lehiste and Peterson, 1961) was approximated by causing the fo to start higher at the onset of a stressed syllable if it began with a voiceless consonant.
A similar view of intonation was described in quite different terminology by 't Hart and Cohen (1973). In the spirit of Bolinger (1951), they defined the intonational "hat pattern," see bottom portion of Fig. 25, as the tendency for intonation to rise on the first stressed syllable of a phrase, and remain high until the final stressed syllable where there is either a dramatic fall or a fall-rise depending on whether more material is to be spoken. The idea of intonational phrases is similar to the idea of the breath group advocated earlier by Lieberman (1967). Translation of these ideas to rules for English was performed by Maeda (1974), who also postulated stress-related local rises above the phrasal hat top whose magnitudes depended on phrasal position -- the size of pitch gestures tending to be reduced over the course of a phrase.
The Maeda rules form the basis for the fo gestures produced by Klattalk. The detailed implementation is based on an idea of Öhman (1967). He proposed that intonation contours can be modeled in terms of impulses and step commands fed to a linear smoothing filter. This type of model has been applied to Japanese intonational synthesis by Fujisaki and Nagashima (1969), who were able to match natural intonation contours with remarkable fidelity. An example of the step and impulsive commands for a sentence generated by Klattalk rules is shown in Fig. 26.
The timing of the fundamental frequency rises and falls with respect to the locations of stressed vowels can have a fairly large perceptual effect. For example, gradual rises extending over the full vowel duration are heard as similar to continuation rises -- indicative of material prior to the most prominent or nuclear syllable of the utterance.
The most detailed current model of fo generation for American English
(Pierrehumbert, 1981; Anderson et al., 1984) takes a somewhat
different approach to the problem, and posits two fo target tones at
an abstract level -- H (high) and L (low). Each stressed syllable of a
sentence is assigned a sequence of zero or one such tones according
to syntax, discourse importance, and rhythmic position. In addition,
there are two extra tones at the end of a phrase, one occurring
between the last accent and the end, and the other occurring right
at the end. These permit various forms of terminal falls and rises
to be constructed. The assignment of fo targets and smooth transitions
between targets is a complex function of a reference fo declination
line (Öhman, 1967; Peck, 1969) and a time-varying pitch range (Cohen
|KLATT 1987, p. 762|