NMAH | Smithsonian Speech Synthesis History Project (dk

of the time using morphological decomposition, involved situations when the surface form did not contain a silent "e" (choking -> choke + ing), there had been consonant doubling (omitted -> omit + ed) or a final "y" had been modified (cities -> city + s). Jonathan Allen and Deborah Finkel extended these techniques by increasing the morpheme dictionary to 12 000 items (Allen et al., 1979; Allen et al., 1987). Morphemes were selected by interactive examination of the approximately 50 000 unique words in the Brown corpus, a sampling of one million words of text (Kucera and Francis, 1967).13

Allen et al. (1979) also developed rules for handling cases where a word has multiple parses (e.g., "scarcity" = "scarce + ity" or "scar + city"). One rule, illustrated by this example, is that affixing is more likely than compounding. None of these guidelines is absolute, so in comparing two alternative morphemic decompositions, the authors invoked a set of heuristic scoring procedures whereby a given morphemic division incurs a scoring penalty depending on what has happened so far. This scoring algorithm picks the correct decomposition for "formally" from among the set (form+all+y, for+mall+y, form+ally, form+al+ly). If, after all of this computation, the word was found to be an exception to the parsing heuristics (e.g., "been" not pronounced as "be" + "-en"), the whole word was added to the morpheme lexicon in unparsed form.14 An alternative method for dealing with inflectional suffixes, derivational affixes, and compounding is discussed in Church (1985, p. 251).

Some morphemes are pronounced differently depending on the stress pattern of the word and the nature of the other morphemes present (note the second "o" of "photo" is realized phonemically as in "photo," "photograph," "photography," respectively). The MITalk group developed rules to handle some of these cases, and simply added whole multimorphemic words to the lexicon if the rule was too complex or not sufficiently productive. The morpheme decomposition algorithm is able to parse about 98% of the words in a typical text, and should have greater accuracy than letter-to-phoneme rules. The exact accuracy of the MITalk morpheme decomposition algorithm was never measured, although a cursory glance at a three-paragraph text (Allen et al., 1987, pp. 89-92) indicates a few (easily correctable) errors and a words-correct rate of only about 95%.

One of the advantages of a morpheme lexicon, aside from an ability to divide compound words properly, is that a set of 12 000 morphemes can represent well over 100 000 English words. Thus a very large vocabulary is achieved at moderate storage cost. However, the greatest advantage of the morpheme lexicon may turn out to be its ability to specify parts of speech information to a syntactic analyzer in order to improve the prosody of sentences, see below.

Recent work at Bell Laboratories (Coker, 1985) has extended this approach by augmenting the morpheme lexicon to 43 000 morphemes, and adding to the rules for suffix and prefix analysis and stress reassignment for the stress-shifting suffixes. The algorithm and morpheme lexicon occupy about 900 kbytes on a developmental real-time text-to-speech board (Olive and Liberman, 1985).

4. Proper names

Proper names are a special problem because the rules for their pronunciation often depend on which language is assumed as the underlying origin of the spelling (Liberman, 1979). The commercial system that performs best at pronouncing proper names, the newest Speech Plus Calltext board, still has an error rate of about 20% in its rule component when confronted with random proper names (Wright et al., 1986). Church (1985) has recently proposed a solution to this problem that involves statistics on the frequency of occurrence of three-letter sequences in each of several languages. The first step is to use these statistics to estimate the language family of the unknown word. For words of moderate length, he finds that frequently one or another letter triple in the word essentially rules out all but the correct language. The second step is to apply stress and letter-to-phoneme rules for the language in question. Performance is claimed to be far superior to that of any system restricted to a single set of rules for all proper names. The importance of doing proper names by rule is brought out by statistical analyses showing that large name dictionaries do not solve the problem. An exceptions dictionary containing 2000 proper names will cover about 50% of the names in a random telephone directory, and 6000 proper names will cover about 60%. However, adding to the exceptions dictionary beyond 6000 names is essentially fruitless in that one is unable to get beyond an asymptote of about 62% of the names in one telephone directory, no matter how many names are obtained from another directory (Church, 1985).

C. Syntactic analysis

Imposition of an appropriate prosodic contour on a sentence requires at least a partial syntactic analysis. Furthermore, some pronunciation ambiguities can be resolved from syntactic information. For example, there are more than 50 noun/verb ambiguous words such as "permit" that are pronounced with stress on the first syllable if a noun, and with stress on the second syllable if a verb (see Appendix D in Conroy et al., 1986). The only way to pronounce these words correctly is to figure out the syntactic structure of an input sentence, including the location of the verbs. Proper phrasing of moderately long clauses also requires knowledge of the locations of phrase boundaries. Thus it would be highly desirable to include a parser in a text-to-speech system.

While powerful parsing strategies exist (see, e.g., Woods, 1970; Aho and Ullman, 1972; Marcus, 1980; Kaplan and Bresnan, 1982), they tend to produce many alternative parses, even for sentences that seem simple and unambiguous. For example, "Time flies like an arrow" is multiply ambiguous at a syntactic level; a syntactic analysis system would require an immense store of world knowledge (semantics/ pragmatics) to behave as we do and focus immediately on the only sensible structural interpretation of the sentence. Allen (1976) foresaw this problem and restricted himself to the goal of selecting the most probable local phrase parse of an arbitrary English sentence. Using the morpheme decomposition algorithm just described, he and Calvin Drake were able to obtain reasonably accurate

	SSSHP Contents \| Labs
Smithsonian Speech Synthesis History Project
National Museum of American History \| Archives Center
Smithsonian Institution \| Privacy \| Terms of Use