NMAH | Smithsonian Speech Synthesis History Project (dk

significant prosodic information, as summarized in Table I.

Segmental factors that can influence stress judgments include vowel reduction (Fry, 1965) and associated phonological recoding/ simplification phenomena. Thus, for example, in the word "photograph," the second vowel is reduced to a short-duration mid schwa vowel , and the /t/ is flapped (compare with "photography").

1. Intensity rules

The intensity pattern of speech tends to set off individual syllables because vowels are usually more intense than consonants. Stressed syllables, which are perceived to be louder than unstressed syllables, may be more intense by a few dB, but intensity per se is not a very effective perceptual cue to stress (Fry, 1958), due in part to the confounding variations in syllable intensity associated with vowel height, fo, laryngeal state, and other factors.

In a formant synthesizer, as in speech, the intensity of a voiced sound automatically goes up in proportion to fo. Thus one can achieve a degree of stress-related intensity increase by rules that only manipulate fo. Experience suggests that including a specific rule to increase stressed vowel intensity produces artificially strong stressed vowels.

At a phrase level, it appears that syllables at the end of an utterance can become weaker in intensity, especially if unstressed. However, it is not clear that this is simply an effect of reduced source intensity; usually the glottal waveform becomes more breathy as well, with a strong fundamental component and weaker high-frequency harmonics (Bickley, 1982).

If prosody is to include these source modifications, as it probably should in order to account for natural changes to voice quality over utterances, then we will need new descriptors and new data to quantify the perceptually important effects. At the very least, a new prosodic dimension is required to characterize a continuum of voice qualities from breathy through normal to creaky (Ladefoged, 1973; Catford, 1977). Other possible dimensions might be related to the stability of the vibration pattern (susceptibility to aperiodicities).

2. Duration rules

Aspects of speech timing are specified and modified by information coming from many different representational levels during speech production. Psychological and semantic variables influence the average speaking rate and determine durational increments due to emphasis or contrastive stress. The syntactic structure of the sentence to be produced determines the locations of prosodic boundaries at which segments are longer in duration. The lexicon and/or stress rules determine which consonants and vowels of a word are stressed and hence longer in duration than unstressed and reduced vowels. The phonological component of the speaking process selects appropriate allophones for the abstract phonemes of lexical items, and executes a set of rules that modify the allophone durations according to phonetic context. These effects have been examined in review papers by Lehiste (1970) and by Klatt (1976a).

As an example of the kinds of rules needed to predict segment durations in sentences, consider the model proposed by Klatt (1979a). The model assumes that (1) each phonetic segment type has an inherent duration that is specified as one of its distinctive properties, 6 (2) each rule tries to effect a percentage increase or decrease in the duration of the segment, but (3) segments cannot be compressed shorter than a certain minimum duration (Klatt, 1973b). The model is summarized by the formula:

DUR = MINDUR +

(INHDUR - MINDUR) x PRCNT

100

, (2)

where INHDUR is the inherent duration of a segment in ms, MINDUR is the minimum duration of a segment if stressed, and PRCNT is the percentage shortening determined by applying rules described in Table II.

Segmental duration is one of the cues that (1) helps distinguish between segments (e.g., short versus long , or short /z/ versus long /s/), (2) determines features of neighboring segments (e.g., the voicing feature of postvocalic obstruents is cued in part by vowel duration -- versus ), (3) distinguishes between stressed and unstressed syllables, (4) signals phrase and clause boundaries, and (5) helps indicate the presence or absence of emphasis. Perceptual disentanglement of these effects is difficult (Klatt, 1982b). In fact, one of the unsolved problems in the development of rule systems for speech timing is the size of the unit (segment, onset/rhyme, syllable, word) best employed to capture various timing phenomena.

Other durational rule systems exist for English (Mattingly, 1968; Barnwell, 1971; Coker et al., 1973; Umeda, 1975, 1977). The rules contained in these systems are similar (not surprisingly), but there are too many ways to describe interacting phenomena, so that, e.g., Gaitenby et al. (1972) and Coker et al. (1973) rely heavily on multiple stress levels conditioned by syntactic category (verbs have less stress than nouns) and conditioned by word frequency (common words and words that are repeated in a discourse are reduced in stress). Other authors postulate rules related to rhythm and isochronous principles (Lehiste, 1977). Neither of these

	SSSHP Contents \| Labs
Smithsonian Speech Synthesis History Project
National Museum of American History \| Archives Center
Smithsonian Institution \| Privacy \| Terms of Use