KLATT 1987, p. 760 |
significant prosodic information, as summarized in Table I. Segmental factors that can influence stress judgments include vowel reduction (Fry, 1965) and associated phonological recoding/ simplification phenomena. Thus, for example, in the word "photograph," the second vowel is reduced to a short-duration mid schwa vowel , and the /t/ is flapped (compare with "photography"). 1. Intensity rulesThe intensity pattern of speech tends to set off individual syllables because vowels are usually more intense than consonants. Stressed syllables, which are perceived to be louder than unstressed syllables, may be more intense by a few dB, but intensity per se is not a very effective perceptual cue to stress (Fry, 1958), due in part to the confounding variations in syllable intensity associated with vowel height, fo, laryngeal state, and other factors. In a formant synthesizer, as in speech, the intensity of a voiced sound automatically goes up in proportion to fo. Thus one can achieve a degree of stress-related intensity increase by rules that only manipulate fo. Experience suggests that including a specific rule to increase stressed vowel intensity produces artificially strong stressed vowels. At a phrase level, it appears that syllables at the end of an utterance can become weaker in intensity, especially if unstressed. However, it is not clear that this is simply an effect of reduced source intensity; usually the glottal waveform becomes more breathy as well, with a strong fundamental component and weaker high-frequency harmonics (Bickley, 1982). If prosody is to include these source modifications, as it probably should in order to account for natural changes to voice quality over utterances, then we will need new descriptors and new data to quantify the perceptually important effects. At the very least, a new prosodic dimension is required to characterize a continuum of voice qualities from breathy through normal to creaky (Ladefoged, 1973; Catford, 1977). Other possible dimensions might be related to the stability of the vibration pattern (susceptibility to aperiodicities). 2. Duration rulesAspects of speech timing are specified and modified by information coming from many different representational levels during speech production. Psychological and semantic variables influence the average speaking rate and determine durational increments due to emphasis or contrastive stress. The syntactic structure of the sentence to be produced determines the locations of prosodic boundaries at which segments are longer in duration. The lexicon and/or stress rules determine which consonants and vowels of a word are stressed and hence longer in duration than unstressed and reduced vowels. The phonological component of the speaking process selects appropriate allophones for the abstract phonemes of lexical items, and executes a set of rules that modify the allophone durations according to phonetic context. These effects have been examined in review papers by Lehiste (1970) and by Klatt (1976a).
As an example of the kinds of rules needed to predict segment
durations in sentences, consider the model proposed by Klatt (1979a).
The model assumes that (1) each phonetic segment type has an inherent
duration that is specified as one of its distinctive
properties, 6 (2)
each rule tries to effect a percentage increase or decrease in the
duration of the segment, but (3) segments cannot be compressed
shorter than a certain minimum duration (Klatt, 1973b). The model is
summarized by the formula:
| |||
| |||
where INHDUR is the inherent duration of a segment in ms, MINDUR is the minimum duration of a segment if stressed, and PRCNT is the percentage shortening determined by applying rules described in Table II. Segmental duration is one of the cues that (1) helps distinguish between segments (e.g., short versus long , or short /z/ versus long /s/), (2) determines features of neighboring segments (e.g., the voicing feature of postvocalic obstruents is cued in part by vowel duration -- versus ), (3) distinguishes between stressed and unstressed syllables, (4) signals phrase and clause boundaries, and (5) helps indicate the presence or absence of emphasis. Perceptual disentanglement of these effects is difficult (Klatt, 1982b). In fact, one of the unsolved problems in the development of rule systems for speech timing is the size of the unit (segment, onset/rhyme, syllable, word) best employed to capture various timing phenomena.
Other durational rule systems exist for English (Mattingly, 1968;
Barnwell, 1971; Coker et al., 1973; Umeda, 1975, 1977). The rules
contained in these systems are similar (not surprisingly), but
there are too many ways to describe interacting phenomena, so that,
e.g., Gaitenby et al. (1972) and Coker et al. (1973) rely
heavily on multiple stress levels conditioned by syntactic category (verbs
have less stress than nouns) and conditioned by word frequency (common
words and words that are repeated in a discourse are reduced in
stress). Other authors postulate rules related to rhythm and isochronous
principles (Lehiste, 1977). Neither of these
|
KLATT 1987, p. 760 |
SSSHP Contents | Labs | |
Smithsonian Speech Synthesis History Project | |
National Museum of American History | Archives Center | |
Smithsonian Institution | Privacy | Terms of Use |