NMAH | Smithsonian Speech Synthesis History Project (dk

text to fluent, intelligible, natural sounding speech. The hope is that this critical review will focus future research in directions having the greatest payoff. The reader should be aware that the author is not an impartial outside observer, but rather an active participant in the field who has many biases that will no doubt color the review.

The steps involved in converting text to speech are illustrated in Fig. 1 (Allen, 1976). First, a set of modules must analyze the text to determine the underlying structure of the sentence, and the phonemic composition of each word. Then, a second set of modules transforms this abstract linguistic representation into a speech waveform. These processes have interesting connections to linguistic theory, models of speech production, and the acoustic-phonetic characterization of language (experimental phonetics), as well as to a topic that Vanderslice (1968) calls "synthetic elocution," or the art of effective reading out loud. The review will focus on the conversion of English text to speech. Systems for other languages will not be reviewed unless they have contributed to the evolution of systems for English.

It might seem more practical to store natural waveforms corresponding to each word of English, and to simply concatenate them to produce sentences, particularly considering the low cost and large capacity of new laser disk technology. However, such an approach is doomed to failure because a spoken sentence is very different from a sequence of words uttered in isolation. In a sentence, words are as short as half their duration when spoken in isolation -- making concatenated speech seem painfully slow. The sentence stress pattern, rhythm, and intonation, which depend on syntactic and semantic factors, are disruptively unnatural when words are simply strung together in a concatenation scheme. Finally, words blend together at an articulatory level in ways that are important to their perceived naturalness and intelligibility. The only satisfactory way to simulate these effects is to go through an intermediate syntactic, phonological, and phonetic transformation. 1

A second problem with approaches that attempt to store representations for whole words is that the number of words that can be encountered in free text is extremely large, due in part to the existence of an unbounded set of proper names [e.g., the Social Security Administration (1985) estimates that there are over 1.7 million different surnames in their files], as well as the existence of general rules that permit the formation of larger words by the addition of prefixes and suffixes to root words, or by compounding. Also, new words are being coined every day. It was hoped that a system employing prerecorded words might spell out such items for the listener, but this has proven to be less than satisfactory. Modem systems to be described below have fairly powerful fall-back procedures to be used when an unfamiliar word is encountered.

For expository reasons, the review is organized backwards with respect to Fig. 1 Only after we have some idea of the nature of the input information required by the synthesis routines will the second section take up the analysis of text.

A. Linguistic framework

A recent trend in linguistics has been to describe a language such as English in generative terms, the goal being to specify rules for the generation of any legitimate sentence of the language (Chomsky and Halle, 1968). I have summarized and simplified this view somewhat in Fig. 2 to indicate how it might be applied to the problem of synthesis. Linguists believe that a sentence can be represented by a sequence of discrete elements, called phonemes, that are drawn from a small set of about 40 such sound building blocks for English (see Table IV). These abstract phonemic symbols might be thought to represent articulatory target configurations or gestures. Thus a word like "beam" consists of three phonemes, the /b/ characterized by lip closure, the vowel /i/ characterized by a high fronted tongue position, and the nasal /m/ characterized by both lip closure and opening of the velar port to the nasal passages. The psychological reality of the phoneme as a unit for representing how words are to be spoken is attested to by collections of speech errors in which phonemic exchanges are common (Fromkin, 1971). Linguists have also found it useful to be able to refer to the components or features of a phoneme, such as the fact that /b/ and /m/ are both + LABIAL, while only /m/ is + NASAL. Rules describing how words change pronunciation in certain sentence contexts are often stated most efficiently in terms of features.

Phoneme strings form larger units such as syllables, words, phrases, and clauses. These structures should be indicated in the underlying representation for an utterance, because aspects of how a sentence is pronounced depend on the

	SSSHP Contents \| Labs
Smithsonian Speech Synthesis History Project
National Museum of American History \| Archives Center
Smithsonian Institution \| Privacy \| Terms of Use