NMAH | Smithsonian Speech Synthesis History Project (dk

replaced by an allophone with distinctly different articulatory/ acoustic properties. For example, the phoneme / l / is realized as a velarized variant following a vowel, while there is normally no velar constriction for word-initial productions of / l / (Lehiste, 1962).

Less clear are those cases where a small change is the result of a low-level articulatory interaction (Schwartz, 1967), or where many small changes can be made along an articulatory/ acoustic dimension such as voice onset time. For example, the time between release of a / t / and voicing onset is typically about 50 ms, but is systematically about 10 ms longer in a word-initial position, e.g., "tone," than it is in prestressed word-medial positions, e.g., "atone," and VOT is shorter if the following vowel is unstressed (Klatt, 1975b). Should one create a separate symbol for each gradation along the voice-onset-time continuum, or handle these effects as low-level adjustments to the time functions that control the synthesizer? Distinctions between allophone selection rules and parameter adjustment rules are necessarily arbitrary, and of relatively little theoretical import to us. 8 The important thing is to be able to produce the appropriate acoustic changes in the synthetic speech, and to do so in an efficient way.

Some of the rules to be discussed below appear to be articulatory simplifications that allow the speaker to be "lazy" in realizing some unstressed phonetic sequences. While ease of pronunciation may play a role in the development of allophonic variation, a far more important function of these rules is to help mark boundaries, especially word boundaries, in the flow of speech (Lehiste, 1959; Nakatani and Dukes, 1977); Lehiste cites many examples where allophones mark boundaries, the best known of which is the distinction between "night rate" and "nitrate," where a listener can easily tell which sequence was intended by the speaker because of stronger frication/ aspiration in the latter case. 9

Most of the rules discussed below are thus not strictly "sloppy speech" rules, and they are not optional rules. They are needed to make sentences sound fluent and natural. The rules help the listener decide the syllable affiliation of consonants and the degree of stress on a syllable, and thus indirectly constrain locations of potential word boundaries, permitting the listener to parse an utterance into words without pursuing too many alternative interpretations (Church, 1983). Phonotactics, or the specification of permitted phonetic sequences at the beginnings, middles, and ends of words, also can provide word boundary hypotheses for the listener (Lamel and Zue, 1984).

The details of phonological rule application differ for the different dialects of English, as well as for different speaking styles (formal/casual) and speaking rates within a given dialect. This is a serious problem for speech recognition devices (see, e.g., the rule compendium of Cohen and Mercer, 1974), but a text-to-speech system need only select rules appropriate for one acceptable dialect of English, and perhaps make some modifications concerning rule applicability as a function of speaking rate (Bernstein and Baldwin, 1985). In Klattalk, some phonetic simplifications across word boundaries are blocked if a phrase boundary is present. This mechanism is used to produce more formal speech at slow speaking rates simply by placing phrase boundary symbols at more minor phrase breaks when analyzing a text. In the future, it might be interesting to attempt to simulate additional dialects and styles by direct manipulation of phonological rules in these systems.

Some of the allophonic phenomena to be described have been known for a long time, many having appeared in phonetics textbooks at least as far back as the 1930's (Bloomfield, 1933; Hocket, 1955; Heffner, 1969). However, acoustic characterization had to await instrumental study. One of the first and best of the acoustic-phonetic studies was performed by Lehiste (1959, 1964). She noted the following kinds of word boundary indicators:

The presence of a laryngealized vowel onset usually signals the beginning of a word that starts with a vowel.
A normally aspirated release of [p,t,k] becomes unaspirated if a preceding [s] is part of the same word ("the spot" versus "this pot").
Selection between an initial or final allophone / r / or / l / intervocalically depends on the location of a word boundary on either side of the consonant.
A vowel is longer in duration in an open syllable (no word-final consonant), and shorter if followed by a voiceless word-final consonant.
A word-final [t,d] is flapped or glottalized before a word beginning with a stressed vowel.

Nakatani and O'Connor-Dukes (1979) extended this work, and concluded that the phonetic cues and stress changes are perceptually more powerful cues to word boundary locations than are durational and pitch changes associated with syntactic boundary movements. They used an analysis-resynthesis system to generate stimuli with, e.g., durational characteristics of one phrase and phonetic characteristics of another in order to obtain perceptual judgments of cue strength. Additional phenomena that they noted include:

Geminate consonants are lengthened with respect to singletons (e.g., the / k / in "drunk converse" versus "drunken verse").
Vowels can be deleted and words resyllabified (e.g., when "bakery" becomes a two-syllable word).
There are restrictions on vowel reduction (e.g., there is reduction in "hard defeat" but not in "hardy feet").

In a subsequent study of [ l ] and [ r ], Lehiste (1962) noted that the prevocalic "light" allophone of / l / as in "lead" has a second formant that depends on the following vowel, postvocalic "dark" or velarized / l / as in "deal" has a lower second formant that is independent of the preceding vowel, and is similar to the syllabic / l / in "bottle." The initial allophone of / r / as in "reed" has lower F1, F2, F3 than the postvocalic allophone as in "deer." The syllabic nucleus as in "dirt" has formant targets similar to the postvocalic allophone.

In a study of the allophones of / t,d / and their distribution, Zue and Laferriere (1979) distinguished:

within-word prestressed variants as in "return" and "reduce,"

	SSSHP Contents \| Labs
Smithsonian Speech Synthesis History Project
National Museum of American History \| Archives Center
Smithsonian Institution \| Privacy \| Terms of Use