SSSHP Contents | Labs

 KLATT 1987, p. 747 
Go to Page | Contents A. Early synthesizers | Index | Bibl. | Page- | Page+
 

that men and women adopt certain speaking strategies and dialectal differences to signal their gender (Kahn, 1975; Labov, 1986).

Based on a detailed spectral analysis of a single female speaker having a pleasant voice quality (Klatt, 1986b), I have begun efforts to synthesize a copy of some of her utterances using the flexibility of the new Klattalk voicing source. The analysis revealed the presence of considerable random breathiness noise at frequencies above 2 kHz over portions of many utterances (a possibility noted earlier by Fujimura, 1968), and considerable variation in both the general tilt of the harmonic spectrum and the strength of the fundamental component. When these factors are modeled in the synthesis, by varying the open quotient, spectral tilt, and breathiness noise amplitude parameters of the Klattalk voicing source, Fig. 12, very good approximations to this voice are achieved for isolated vowels. Success was achieved even though I used a cascade synthesizer rather than the parallel configuration advocated by Holmes, and therefore did not have direct control over each formant amplitude. Also, for at least this one voice, the source spectral zeros seemed to be well matched in location and depth with respect to observed natural spectral dips, even though only the open quotient parameter was available as a means of adjusting the frequency positions of the zeros.

In order to see if the preliminary success with isolated vowels could be generalized to more complex speech materials, the next step taken was to analyze a set of reiterant sentences that were spoken by replacing all of the intended syllables by or , where [V] was one of six English vowels. Utterances involving a glottal stop were considerably easier to model (example 10 of the Appendix). The vowel spectra generally conformed to the simplified acoustic theory implicit in the synthesizer. However, in the [hV] materials, many of the voiced intervals revealed additional formant peaks and other harmonic amplitude discrepancies, presumably related to acoustic coupling with tracheal resonances when the glottis is partially open. An example is shown in Fig. 13. My best synthesis efforts that did not contain these irregularities were judged to be less human and less like the speaker than in the case of the glottal stop syllables.

These results suggest that spectral details in the mid and low frequencies can be of considerable importance to speaker identity and to naturalness judgments, especially in a female voice, where harmonics are widely spaced and more easily resolved by the auditory system. At this point, it is hard to decide how best to augment the synthesizer in order to model the sudden appearance of additional formants and zeros in breathy vowels. For example, would one additional pole-zero pair be sufficient to approximate the primary perceptual effects of tracheal interactions? Also needed are data upon which to base rules for positioning additional resonance peaks and dips as a function of presumed glottal state and vocal tract shape (it is tough enough estimating formant frequencies in high-pitched voices -- to require the simultaneous detection of an unknown number of additional pole-zero pairs as well as specification of glottal source parameters may be asking too much). Nevertheless, a preliminary attempt to analyze and synthesize a full sentence using a synthesizer configuration augmented by an extra tracheal pole-zero pair (first part of example 10 of the Appendix) has met with some success.

An alternative solution to the problem of producing a natural female voice quality by a formant synthesizer might be to employ articulatory models of the trachea, vocal folds, and vocal tract, as well as their interactions, in a sophisticated articulatory synthesizer. Thus we now turn to efforts to produce speech by direct simulation of the mechanisms involved in speech generation.

4. Articulatory models

The transfer function of the vocal tract can be modeled by formant resonators, as above, or by a direct transmission line analog of the distribution of incremental pressures and volume velocities in a tube shaped like the vocal tract. In an articulatory model the tube corresponding to the vocal tract is usually divided into many small sections, and each section is approximated by an electrical transmission line analog (Dunn, 1950; Stevens et al., 1953). The equations are summarized in Flanagan (1972).

These first electronic models were static and required the hand adjustment of a variable inductor in each section. The possibility of dynamic control was added to the M.I.T. model of Stevens et al. (1953) by Rosen (1958). The electronic circuits, shown in Fig. 14, included a buzz source for voicing, and the ability to inject a noise source at the location of a constriction in the vocal tract. Hecker (1962) added a sidebranch to approximate the nasal tract. In 1961 at the fall meeting of the Acoustical Society of America, Kenneth Stevens and Arthur House demonstrated that such models
 

Go to Page | Contents A. Early synthesizers | Index | Bibl. | Page- | Page+

 KLATT 1987, p. 747 
SSSHP Contents | Labs
Smithsonian Speech Synthesis History Project
National Museum of American History | Archives Center
Smithsonian Institution | Privacy | Terms of Use