|KLATT 1987, p. 752|
at the moment of release of the tongue tip from the roof of the mouth. Sonorant target values for F1, F2, and F3 depend somewhat on the following vowel, and a sonorant, particularly a postvocalic sonorant, can modify the formant values of the vowel a great deal (Lehiste, 1962).
The consonant /h/ is sometimes grouped with the fricatives because it is noise-excited, but /h/ functions more like a voiceless sonorant consonant. The sound source for /h/ is aspiration generated near the larynx, the vocal tract assumes the shape of the following vowel, and all formants are weakly excited by the noise.
c. Fricatives. Fricative consonants involve the generation of turbulence noise at a constriction in the vocal tract (Heinz and Stevens, 1961). The noise primarily excites the formants associated with the cavities in front of the constriction (Fant, 1960; Stevens, 1972). Acoustic properties that distinguish the English fricatives from one another include the general spectral shape of the frication noise and the motions of the formants into and out of adjacent sounds, rows 3 and 4 of Fig. 18. Each fricative noise has a relatively fixed characteristic spectral shape, although there are differences observed across speakers and across phonetic environments -- e.g., anticipatory lip rounding for a rounded vowel may lower the frequencies of the most prominent spectral peaks slightly. Formant motion cues, which are particularly important for distinguishing between and (Harris, 1958), depend to a much greater extent on the vocal tract shape of adjacent vowels. The voiced fricatives of English are shorter than voiceless and usually contain simultaneous voicing at low frequencies.
d. Plosives. The voiced plosives of English, /b,d,g/ consist of a closure interval, a brief burst of turbulence noise at release, and formant transitions into and out of adjacent segments (Fischer-Jorgensen, 1954; Halle et al., 1957). The spectrum of the noise burst, its duration, and the motions of the formants into a following vowel have all been shown to be important perceptual cues under some circumstances (Cooper et al., 1952; Delattre et al., 1955). While nominally voiced, /b,d,g/ include evidence of voicing during closure, i.e., the periodic low-frequency energy known as a voicebar, only in certain phonetic environments. Devoiced allophones, as well as several other allophones that occur in specific phonetic/ stress environments, are discussed in Sec. I D 4 on phonological recoding.
The voiceless plosives of English, /p,t,k/, are similar to /b,d,g/ except that there is an interval of /h/-like aspiration noise following the burst because vocal fold adduction necessary for voicing onset is delayed (Liberman et al., 1958; Lisker and Abramson, 1967). Most of the formant transitions take place while aspiration is the sound source. The burst is slightly longer and more intense, and formant transitions are somewhat less distinct in voiceless plosives, making the burst a more potent cue to place of articulation.
The English affricates and are usually analyzed phonetically as consisting of a plosive followed by a fricative, i.e., and . Their observed acoustic properties, Fig. 18, generally agree with such an assumption, although the duration of frication noise is less than in a full fricative (Gerstman, 1957).
e. Nasals. The nasal consonants consist of a murmur during the interval when the oral cavity is closed, and rapid transitions into and out of adjacent segments, row 5 of Fig. 18. The murmur has a complex spectrum with a strong first formant prominence at about 300 Hz. There are both poles and zeros in the transfer function, with frequency locations dependent on the length of the side-branch resonator formed by the occluded oral cavity (Fant, 1960; Fujimura, 1962). Formant transitions into adjacent segments are similar to those for the corresponding voiced plosive (Liberman et al., 1954), although there is usually some degree of nasalization of adjacent segments to complicate the picture (Fujimura, 1960). The primary acoustic cue to nasalization of a vowel is the splitting of F1 into a pole-zero-pole complex (Stevens et al., 1987). It is difficult to distinguish one nasal consonant from another if presented only with the murmur spectrum (Malecot, 1956); formant transitions appear to be somewhat more potent cues to place of articulation, although it is perhaps the relation of the onset spectrum at release to the murmur that is perceptually most important to place-of-articulation judgments (Repp, 1986).
While this brief sketch of the acoustic properties of consonant-vowel syllables has identified some of the relevant early literature, it is important to realize that the studies referenced are not always sufficiently detailed for synthesis purposes, and isolated CV syllables are far from an exhaustive inventory of phenomena that must be treated in a rule program (see later sections on allophonics and prosody). Also, prevocalic and postvocalic consonant clusters introduce additional complications. A serious worker entering this field will probably have to develop an extensive personal data base of speech materials for analysis, rule development, and perceptual validation of chosen synthesis strategies.
C. Segmental synthesis-by-rule programs
The speech copying techniques described earlier succeed, in part, because they reproduce essentially all of the potential cues present in the waveform or spectrum, even though we may not know which cues are most important to the human listener. A synthesis-by-rule program, on the other hand, constitutes a set of rules for generating what are often highly stylized and simplified approximations to natural speech. As such, the rules are an embodiment of a theory as to exactly which cues are important for each phonetic contrast.
Early rule programs have been described and compared in a good review paper prepared by Mattingly (1974) [Ed: online, this site], so only the highlights will be mentioned here. Techniques have been divided into three broad categories: (1) heuristic acoustic-domain rules to control a formant synthesizer, (2) articulatory rules to control a model of the larynx and vocal tract, and (3) strategies for concatenating pieces of encoded natural speech.
1. Formant-based rule programs
The first synthesis-by-rule program capable of synthesizing speech
from a phonemic representation was written by Kelly and Gerstman (1961,
1964). They used a cascaded
|KLATT 1987, p. 752|