| KLATT 1987, p. 752 |
at the moment of release of the tongue tip from the roof of the mouth. Sonorant target values for F1, F2, and F3 depend somewhat on the following vowel, and a sonorant, particularly a postvocalic sonorant, can modify the formant values of the vowel a great deal (Lehiste, 1962). The consonant /h/ is sometimes grouped with the fricatives because it is noise-excited, but /h/ functions more like a voiceless sonorant consonant. The sound source for /h/ is aspiration generated near the larynx, the vocal tract assumes the shape of the following vowel, and all formants are weakly excited by the noise.
c. Fricatives. Fricative consonants involve the generation of
turbulence noise at a constriction in the vocal tract (Heinz and
Stevens, 1961). The noise primarily excites the formants associated
with the cavities in front of the constriction (Fant, 1960; Stevens,
1972). Acoustic properties that distinguish the English fricatives
from one another include the general spectral shape of the frication
noise and the motions of the formants into and out of adjacent
sounds, rows 3 and 4 of
Fig. 18. Each fricative noise has a relatively
fixed characteristic spectral shape, although there are differences
observed across speakers and across phonetic environments -- e.g.,
anticipatory lip rounding for a rounded vowel may lower the frequencies
of the most prominent spectral peaks slightly. Formant motion cues,
which are particularly important for distinguishing between
d. Plosives. The voiced plosives of English, /b,d,g/ consist of a closure interval, a brief burst of turbulence noise at release, and formant transitions into and out of adjacent segments (Fischer-Jorgensen, 1954; Halle et al., 1957). The spectrum of the noise burst, its duration, and the motions of the formants into a following vowel have all been shown to be important perceptual cues under some circumstances (Cooper et al., 1952; Delattre et al., 1955). While nominally voiced, /b,d,g/ include evidence of voicing during closure, i.e., the periodic low-frequency energy known as a voicebar, only in certain phonetic environments. Devoiced allophones, as well as several other allophones that occur in specific phonetic/ stress environments, are discussed in Sec. I D 4 on phonological recoding. The voiceless plosives of English, /p,t,k/, are similar to /b,d,g/ except that there is an interval of /h/-like aspiration noise following the burst because vocal fold adduction necessary for voicing onset is delayed (Liberman et al., 1958; Lisker and Abramson, 1967). Most of the formant transitions take place while aspiration is the sound source. The burst is slightly longer and more intense, and formant transitions are somewhat less distinct in voiceless plosives, making the burst a more potent cue to place of articulation.
The English affricates
e. Nasals. The nasal consonants
While this brief sketch of the acoustic properties of consonant-vowel syllables has identified some of the relevant early literature, it is important to realize that the studies referenced are not always sufficiently detailed for synthesis purposes, and isolated CV syllables are far from an exhaustive inventory of phenomena that must be treated in a rule program (see later sections on allophonics and prosody). Also, prevocalic and postvocalic consonant clusters introduce additional complications. A serious worker entering this field will probably have to develop an extensive personal data base of speech materials for analysis, rule development, and perceptual validation of chosen synthesis strategies. C. Segmental synthesis-by-rule programsThe speech copying techniques described earlier succeed, in part, because they reproduce essentially all of the potential cues present in the waveform or spectrum, even though we may not know which cues are most important to the human listener. A synthesis-by-rule program, on the other hand, constitutes a set of rules for generating what are often highly stylized and simplified approximations to natural speech. As such, the rules are an embodiment of a theory as to exactly which cues are important for each phonetic contrast. Early rule programs have been described and compared in a good review paper prepared by Mattingly (1974) [Ed: online, this site], so only the highlights will be mentioned here. Techniques have been divided into three broad categories: (1) heuristic acoustic-domain rules to control a formant synthesizer, (2) articulatory rules to control a model of the larynx and vocal tract, and (3) strategies for concatenating pieces of encoded natural speech. 1. Formant-based rule programs
The first synthesis-by-rule program capable of synthesizing speech
from a phonemic representation was written by Kelly and Gerstman (1961,
1964). They used a cascaded
|
| KLATT 1987, p. 752 |
| SSSHP Contents | Labs | |
| Smithsonian Speech Synthesis History Project | |
| Archives Center | |
| NATIONAL MUSEUM OF AMERICAN HISTORY | |
| Smithsonian Institution - Washington, D.C. 20560 | |