SSSHP Contents | Labs

 KLATT 1987, p. 763 
Go to Page | Contents D. Prosody | Index | Bibl. | Page- | Page+
 

't Hart, 1967). The model can deal with a wide range of observed intonational patterns, but many of the patterns could only be predicted from text if one were a mind reader (Bolinger, 1972). A stripped-down version of the model is used in the Bell Laboratories text-to-speech system described earlier. Demonstrations of the system (example 34 of the Appendix) use input text where adjective-noun and compound stress patterns are hand corrected if necessary, because getting this aspect of prosody correct is both difficult and perceptually quite important.

It can be frustrating to work with rule systems for generation of fo and duration patterns for sentences in a text-to-speech context because one depends on sentence analysis routines to determine aspects of syntactic structure or semantic importance, and these routines are often wrong. When a text-to-speech system makes a phonemic pronunciation error, the user may be able to override the text-to-phoneme process by re-specifying the word phonemically. Fortunately, in some systems, the same type of user correction capabilities exists for prosodic errors. For example, DECtalk permits syntactic symbols to be placed in the orthographic or phonemic transcription. If this does not lead to a better prosodic reading, the device will accept durations, specified in ms, for any input phonetic segment (Conroy et al., 1986). A hand-drawn fundamental frequency contour can also be specified by straight-line interpolation between fo targets specified at the end of each phonetic segment. Fairly natural prosody can be achieved by the painstaking copying of a recorded utterance using these facilities.

4. Allophone selection

We have assumed that words are lexically represented by phonemes and stress symbols. Allophone selection is then an important aspect of the sentence generation process. For example, the word "city" might appear in a pronouncing dictionary as /s'Iti/, i.e., with a medial /t/ phoneme, but the word is almost always pronounced with a flap variant of the /t/, see Fig. 27. It might appear possible to obviate the need for a flapping rule by simply representing "city" with a flap in the first place. However, a flap rule is still required in a text-to-speech system in order to turn the fully released [t] of "bait" into a flap in a phrase such as "bait a hook." Slightly oversimplifying, a /t/ is flapped in American English between two sonorants if the second is unstressed. At least for those cases where a phoneme can take on different allophones depending on the context of the word, a set of allophone selection rules is unavoidable. Cross-word-boundary phonological recoding is significant in English, as we will see.

Part of the problem of speaking naturally concerns the phonetic form of function words. Words such as "for," "to," "him" often take on the reduced forms , , and (Heffner, 1969), but not in all phonetic environments. For example, in Klattalk, "for" is not reduced if the next segment is a vowel or silence. If these words are never reduced, the speech sounds stilted (something like that of a bad actor trying to articulate carefully), while over-application of rules for reducing function words may lead to misperceptions as to the number of syllables in an utterance.

While a phoneme inventory for English can be specified with little debate, selection of an appropriate inventory of allophonic symbols involves many conflicting criteria and tradeoffs. The clearest cases are those where a phoneme is
 

Go to Page | Contents D. Prosody | Index | Bibl. | Page- | Page+

 KLATT 1987, p. 763 
SSSHP Contents | Labs
Smithsonian Speech Synthesis History Project
National Museum of American History | Archives Center
Smithsonian Institution | Privacy | Terms of Use