SSSHP Contents | Labs

 KLATT 1987, p. 759 
Go to Page | Contents C. Segmental synthesis | Index | Bibl. | Page- | Page+
 

seemed quite intelligible (example 18 of the Appendix), but the project was terminated for business rather than technical reasons before they were able to add rules for automatically generating segment durations and an fo contour from an abstract phonemic representation.

The advent of linear prediction speech analysis/ resynthesis techniques opened up the possibility of automated procedures for creation of a diphone inventory. Olive and Spickenagle (1976) attempted to extract the essential features from each diphone by characterizing it in terms of an initial linear prediction pseudo-area function and a linear transition to a final pseudo area function. Diphones obtained from stressed syllables could be used to synthesize new stressed syllables, but the extensive time expansion and time contraction of diphones that is required to satisfy timing rules for stressed and unstressed syllables of English sentences have been a problem. The expected large gain in naturalness that one might expect from utilization of pieces derived from natural speech has not been realized due to compromises that are necessary, such as smoothing at diphone boundaries, changing the duration of the diphones, and imposing a fundamental frequency contour different from that originally recorded (example 22 of the Appendix). At this time, the naturalness of text-to-speech systems based on linear prediction diphones is not significantly better or worse than formant synthesis by rule, in my opinion, although the two types of systems seem to have a different set of perceived deficiencies in naturalness. Diphones must all be recorded by a speaker who can control (hold constant) voice quality so that there aren't sudden changes in the source spectrum in the middle of syllables. But this also means that there is no simple way to change voice quality over a sentence as a function of syllable stress and position within a sentence, leading to a somewhat stereotyped voice quality. The buzziness inherent in LPC also degrades perceived voice quality. On the other hand, a flexible formant synthesizer may permit manipulation of the voicing source characteristics over a sentence, but we do not yet know the rules to do this in an optimal way.

The intelligibility of carefully chosen diphones can be quite high, especially with modern methods, such as the use of multipulse linear prediction (Atal and Remde, 1982) to more accurately characterize noise bursts and other onsets. A third generation of the Olive diphone concatenation scheme is used in an experimental AT&T Bell Laboratories text-to-speech system (Olive and Liberman, 1985) (example 34 of the Appendix). An earlier version of this Bell Laboratories system has been demonstrated for several years at the Epcot Center of Walt Disney World. Conversant Systems, a wholly owned subsidiary of AT&T, has indicated plans to offer for sale a version of this system, although no date has been set for its availability.

A closely related alternative to the diphone is the demisyllable (Fujimura and Lovins, 1978), i.e., half of a syllable. The inventory of half-syllables in English is about 1000 if one is clever about the treatment of certain postvocalic clusters (treating morphemic plural and past consonant sequences such as "-s" and "-t" as separable units, as suggested by Fujimura and Lovins). The advantage of the demisyllable is that highly coarticulated syllable-internal consonant clusters are treated as units, while the disadvantage is that coarticulation across syllables is not treated very well. A synthesis-by-rule program based on demisyllables has been demonstrated by Browman (1980) (example 23 of the Appendix). Perhaps the best choice among concatenation models is a hybrid diphone approach that uses consonant clusters as units when necessary to model the acoustic manifestations of consonant sequences in a satisfactory way (Olive and Liberman, 1979).

In summary, efforts to develop methods for synthesizing phonetic segments to make up arbitrary sentences have proceeded along three lines: creation of (1) heuristic rules for controlling formant synthesizers, (2) "natural" rules for controlling articulatory models, and (3) methods for concatenating pieces of lpc-encoded real speech. The inherent attraction of articulatory solutions must be tempered by practical considerations of computational cost and lack of data upon which to develop rules. The choice between rule systems for formant synthesizers and concatenation strategies may ultimately depend on limits to the flexibility and naturalness of concatenation schemes involving encoded natural speech, but the best current lpc-based systems are quite competitive with the best formant-based rule programs.

D. Prosody and sentence-level phonetic recoding

A sentence cannot be synthesized by simply stringing together a sequence of phonemes or words. It is very important to get the timing, intonation, and allophonic detail correct in order that a sentence sound intelligible and moderately natural, Fig. 24. Prosodic details also help the listener segment the acoustic stream into words and phrases (Nakatani and Schafer, 1978; Svensson, 1974; Streeter, 1978). The following three sections take up these topics in detail.

A pure tone can be characterized in physical terms by its intensity, duration, and fundamental frequency. These induce the sensations of loudness, length, and pitch, respectively. In speech, it is the change over time in these prosodic parameters of intensity, duration, and fo that carry linguistically
 

Go to Page | Contents D. Prosody | Index | Bibl. | Page- | Page+

 KLATT 1987, p. 759 
SSSHP Contents | Labs
Smithsonian Speech Synthesis History Project
National Museum of American History | Archives Center
Smithsonian Institution | Privacy | Terms of Use