NMAH | Smithsonian Speech Synthesis History Project (dk

suggested by statistics indicating that over 90% of bisyllabic nouns have stress on the first syllable, while only about 15% of bisyllabic verbs are stressed on the first syllable (Francis and Kucera, 1982).

One issue faced by designers of systems is which to do first, stress prediction or phoneme prediction. Another issue is whether to essentially work forward or backward through the letter string for a word. While no system has to go only left to right, or completely settle stress prediction prior to phonemic analysis, there seem to be clear advantages to working backwards through the letter string, and to having stress information prior to making vowel decisions (Bernstein and Nessly, 1981).

2. Exceptions to the rules

When evaluating a set of letter-to-phoneme rules, it is easy to make up lists of words that fail to be pronounced properly. Systematic comparison of the rules against a list of frequent words can produce a dictionary of exceptions that, if added to the system, will make overall pronunciation performance much better than for a system that only uses rules. The utility of a small exceptions dictionary can be appreciated by observing the ability of a small number of most frequent words to account for a given fraction of words in running text (Hunnicutt, 1980). The data are reproduced in Fig. 32. They indicate that a small number of words, about 200, are required to cover half the words occurring in a random text. With a dictionary of 2000 words, over 70% of the words in text will be matched and not have to go through letter-to-sound rules. However, the law of diminishing returns begins to take over shortly after this point -- if one extrapolates from the slope of the curve prior to 10 000 words,12 as indicated by the dashed line in Fig. 32, it appears that to go from 90% to 93% coverage would require about an additional 60 000 words!

Elovitz et al. (1976) and Hertz (1982) embed lists of exceptions inside the letter-to-sound rules of their systems (such as the observation that the letter "f" is pronounced with a voiceless / f / phoneme in all words except "of") so as to ensure getting common words correct, whereas others tend to segregate out exceptions as a separate dictionary. The best performance for a rule system without exceptions dictionary, better than 85% correct when tested on a random sample from a large dictionary, has been obtained by the Bernstein rules that are a part of the Speech Plus, Inc. Prose-2000 (Groner et al., 1982). Bernstein argues that it is possible to design a letter-to-sound algorithm with a very simple structure -- consisting of one right-to-left pass through the letters, starting inside all stress-neutral suffixes.

A moderate-sized exceptions dictionary can hide the deficiencies of a weak set of letter-to-sound rules, but at a high cost in terms of storage requirements. Based on data shown in Fig. 32, Hunnicutt (1980) showed that the size of an exceptions dictionary required to get a target fraction of input words pronounced correctly in a typical running text is a strong function of letter-to-sound rule performance. For example, the 3000-word exceptions dictionary in the Speech Plus Prose-2000, coupled with rules that are correct 85% of the time, results in an overall system performance of better than 97% correct (only 1 word in 33 in a typical text contains a noticeable phoneme or stress error). On the other hand, the first version of DECtalk, employing the Hunnicutt (1980) rules with 65% accuracy and a larger 6000-word exceptions dictionary, barely reached 95% correct (1 error every 20 words). Independent confirmation of this accuracy comparison comes from Huggins et al. (1986), who examined over 1600 low-frequency polysyllabic words and found phonemic mispronunciations in 8.3% for the new Speech Plus Calltext system, compared with 12.9% errors for Version 1 of DECtalk. The current DECtalk, Version 3.0, uses a new letter-to-sound rule system (Conroy and Vitale, 1986) to achieve performance of fewer than 6% errors for this data set, according to my evaluation.

In the future, it is expected that morpheme-based algorithms (see below) will replace exceptions dictionaries in commercial systems because the cost of memory is such that the added performance is well worth the expense. Similarly, special algorithms for pronunciation of names are likely to be incorporated in commercial systems in the near future. Special purpose vocabularies, such as a dictionary of medical terms, will probably also become available in response to market pressures.

3. Morphemic decomposition

Problems with the pronunciation of compounds such as the "th" in "hothouse" and the silent "e" in "houseboat" led Lee (1969) to attempt to break each word into morphemes, the minimal meaningful unit of language (see, e.g., Bloomfield, 1933, Chaps. 10, 13-14). Using a dictionary of about 3000 morphemes, Lee was able to split a word such as "houseboats" into "house" plus "boat" plus the plural "-s," and to retrieve from storage or predict the pronunciation of each piece. Lee developed techniques for recovering the proper base form after an affix was removed. The three most common problems, which could be handled correctly most

	SSSHP Contents \| Labs
Smithsonian Speech Synthesis History Project
National Museum of American History \| Archives Center
Smithsonian Institution \| Privacy \| Terms of Use