NMAH | Smithsonian Speech Synthesis History Project (dk

85% correct at a word level (all phonemes correct and stress pattern correct) in a random sampling of a very large dictionary, which implies a phoneme correct rate of better than 97%. 10

NETtalk is related to a letter-pattern learning program described earlier by Lucassen and Mercer (1984). They defined a set of features corresponding to "random" sets of letters, used the forward-backward algorithm of the IBM speech recognition strategy (analogous to incremental training) on a 50 000 word lexicon to find the best feature sets for predicting individual phonemes, and established a set of probabilities (analogous to weights) for a search tree recognition model, based again on a seven-letter input window. They obtained correct letter-to-phoneme correspondences for 94% of the letters in words in a random sample from a 5000 word office-correspondence lexicon. In terms of error rate, this is slightly better than NETtalk, especially considering that some fraction of the test words was probably not in the training set, but the Lucassen and Mercer approach still results in an inferior words-correct error rate compared with traditional rule systems. Even a very powerful statistical package cannot yet discover much of the underlying structure in a process as complex as natural language.

A proposal in the psychological literature related to these pattern learning programs is that readers learn the letter-to-phoneme conversion rules not as explicit rules, but by analogy with similar local letter patterns in words that they already know how to pronounce (Glushko, 1981; Dedina and Nusbaum, 1986). For example, a novel word might be compared with all words in the lexicon, and the word sharing the largest number of letters with the unknown word would get to determine the pronunciation of that local substring. Glushko showed that subjects were slower to pronounce pseudowords that would have two equally likely alternative pronunciations if this strategy were followed. A computer implementation of a slightly more complicated version of this strategy (taking into account frequency of occurrence of analogous words) agreed with one of the pronunciations furnished by human subjects 91% of the time when tested on 70 simple pseudowords (Dedina and Nusbaum, 1986), while DECtalk pronunciations agreed with the response from at least one of seven human subjects 97% of the time. Klatt and Shipman (1982) defined a way in which the substring comparison strategy might be performed optimally and rapidly, one letter at a time, by creating a moderate-sized decision tree. They examined the performance when a 20 000 word phonemic dictionary was divided in half randomly such that the first half was used to create the tree, and the second half used to test it. The error rate for individual letters was 7%, which is not bad considering that test and training data were different, but this performance is still not nearly good enough to compete with conventional rule systems. Consonantal letters were found to be quite regular and amenable to translation with low error rates by this approach. However, the five vowel letters and "Y" accounted for four-fifths of the errors. In summary, given the attention that NETtalk and other neuron-like devices have received recently, it is disturbing that NETtalk does not learn training set data perfectly, appears to make generalizations suboptimally, and has an overall performance that is not acceptable for a practical system. Furthermore, it is unlikely that larger training lexicons would converge to a more acceptable performance. Asside from limitations imposed by the network model, problems inherent in all these approaches are (1) the considerable extent of letter context that can influence stress patterns in a long word (and hence affect vowel quality in words like "photograph/ photography"), (2) the confusion caused by some letter pairs, like CH, which function as a single letter in a deep sense, and thus misalign any relevant letters occurring further from the vowel, and (3) the difficulty of dealing with compound words (such as "houseboat" with its silent "e"), i.e., compounds act as if a space were hidden between two of the letters inside the word. The necessity of morphemic analysis is supported by data indicating that good spellers look for morphemes inside letter strings (Fischer et al., 1985), whereas to date these learning models seek regularities in letter patterns without recourse to a lexicon of any sort. On the other hand, efforts to find clear psychological evidence for morphological analysis of complex related forms (as opposed to rote learning of each) for word pairs such as "heal/health," "original/ originality," "magic/ magician," and "sign/signal" have generally failed (Carlisle, 1985).

1. Prediction of lexical stress from orthography

The Hunnicutt (1976) rule system included the improved version of Chomsky-Halle stress rules (Halle and Keyser, 1971) consisting of eight general rules, the most well-known of which are the main and alternating stress rules for predicting which syllable receives primary stress as a function of the "strong/weak" syllable pattern of the word. Also included were rules for decomposing words by stripping off affixes to recover the root. About 15 different prefixes and 50 suffixes were detected. Grammatical constraints were invoked to prevent incompatible suffix sequences from being removed. Orthographic features permitted rules to refer to concepts such as "true consonant" and "vowel-like letter." While the best performing algorithm of its time, this system was completely correct for only about 65% of a random selection of words (Hunnicutt, 1980). 11 A good fraction of the errors made by this letter-to-phoneme system were stress errors. In fact, Bernstein and Nessly (1981) showed that a much simpler set of stress rules described by Hill and Nessly (1973) performed about as well as the Chomsky-Halle implementation. More recent high-performance letter-to-phoneme rule systems (Bernstein and Nessly, 1981; Hunnicutt, 1980; Hertz, 1982; Carlson et al., 1982a; Church, 1985; Conroy and Vitale, 1986) include improved attempts at morphemic decomposition and stress prediction. Stress assignment is perhaps the weakest link in all systems because an incorrect stress pattern, while perceptually disruptive in and of itself, usually also triggers mis-selection of vowel qualities. The newer systems not only base stress assignment on factors such as morphological structure and the distinction between strong and weak syllables (Chomsky and Halle, 1968), but also on presumed part of speech, and in some cases, etymology (for a good review, see Church, 1985). The importance of syntactic categorization is

	SSSHP Contents \| Labs
Smithsonian Speech Synthesis History Project
National Museum of American History \| Archives Center
Smithsonian Institution \| Privacy \| Terms of Use