KLATT 1987, p. 771 |
85% correct at a word level (all phonemes correct and stress pattern correct) in a random sampling of a very large dictionary, which implies a phoneme correct rate of better than 97%. 10 NETtalk is related to a letter-pattern learning program described earlier by Lucassen and Mercer (1984). They defined a set of features corresponding to "random" sets of letters, used the forward-backward algorithm of the IBM speech recognition strategy (analogous to incremental training) on a 50 000 word lexicon to find the best feature sets for predicting individual phonemes, and established a set of probabilities (analogous to weights) for a search tree recognition model, based again on a seven-letter input window. They obtained correct letter-to-phoneme correspondences for 94% of the letters in words in a random sample from a 5000 word office-correspondence lexicon. In terms of error rate, this is slightly better than NETtalk, especially considering that some fraction of the test words was probably not in the training set, but the Lucassen and Mercer approach still results in an inferior words-correct error rate compared with traditional rule systems. Even a very powerful statistical package cannot yet discover much of the underlying structure in a process as complex as natural language. A proposal in the psychological literature related to these pattern learning programs is that readers learn the letter-to-phoneme conversion rules not as explicit rules, but by analogy with similar local letter patterns in words that they already know how to pronounce (Glushko, 1981; Dedina and Nusbaum, 1986). For example, a novel word might be compared with all words in the lexicon, and the word sharing the largest number of letters with the unknown word would get to determine the pronunciation of that local substring. Glushko showed that subjects were slower to pronounce pseudowords that would have two equally likely alternative pronunciations if this strategy were followed. A computer implementation of a slightly more complicated version of this strategy (taking into account frequency of occurrence of analogous words) agreed with one of the pronunciations furnished by human subjects 91% of the time when tested on 70 simple pseudowords (Dedina and Nusbaum, 1986), while DECtalk pronunciations agreed with the response from at least one of seven human subjects 97% of the time. Klatt and Shipman (1982) defined a way in which the substring comparison strategy might be performed optimally and rapidly, one letter at a time, by creating a moderate-sized decision tree. They examined the performance when a 20 000 word phonemic dictionary was divided in half randomly such that the first half was used to create the tree, and the second half used to test it. The error rate for individual letters was 7%, which is not bad considering that test and training data were different, but this performance is still not nearly good enough to compete with conventional rule systems. Consonantal letters were found to be quite regular and amenable to translation with low error rates by this approach. However, the five vowel letters and "Y" accounted for four-fifths of the errors. In summary, given the attention that NETtalk and other neuron-like devices have received recently, it is disturbing that NETtalk does not learn training set data perfectly, appears to make generalizations suboptimally, and has an overall performance that is not acceptable for a practical system. Furthermore, it is unlikely that larger training lexicons would converge to a more acceptable performance. Asside from limitations imposed by the network model, problems inherent in all these approaches are (1) the considerable extent of letter context that can influence stress patterns in a long word (and hence affect vowel quality in words like "photograph/ photography"), (2) the confusion caused by some letter pairs, like CH, which function as a single letter in a deep sense, and thus misalign any relevant letters occurring further from the vowel, and (3) the difficulty of dealing with compound words (such as "houseboat" with its silent "e"), i.e., compounds act as if a space were hidden between two of the letters inside the word. The necessity of morphemic analysis is supported by data indicating that good spellers look for morphemes inside letter strings (Fischer et al., 1985), whereas to date these learning models seek regularities in letter patterns without recourse to a lexicon of any sort. On the other hand, efforts to find clear psychological evidence for morphological analysis of complex related forms (as opposed to rote learning of each) for word pairs such as "heal/health," "original/ originality," "magic/ magician," and "sign/signal" have generally failed (Carlisle, 1985). 1. Prediction of lexical stress from orthography
The Hunnicutt (1976) rule system included the improved version of
Chomsky-Halle stress rules (Halle and Keyser, 1971) consisting of
eight general rules, the most well-known of which are the main and
alternating stress rules for predicting which syllable receives
primary stress as a function of the "strong/weak" syllable pattern of
the word. Also included were rules for decomposing words by stripping
off affixes to recover the root. About 15 different prefixes and 50
suffixes were detected. Grammatical constraints were invoked to
prevent incompatible suffix sequences from being removed. Orthographic
features permitted rules to refer to concepts such as "true consonant"
and "vowel-like letter." While the best performing algorithm of its
time, this system was completely correct for only about 65% of a random
selection of words (Hunnicutt, 1980). 11
A good fraction of the errors
made by this letter-to-phoneme system were stress errors. In fact,
Bernstein and Nessly (1981) showed that a much simpler set of stress
rules described by Hill and Nessly (1973) performed about as well as
the Chomsky-Halle implementation. More recent high-performance
letter-to-phoneme rule systems (Bernstein and Nessly, 1981; Hunnicutt,
1980; Hertz, 1982; Carlson et al., 1982a; Church, 1985; Conroy and
Vitale, 1986) include improved attempts at morphemic decomposition
and stress prediction. Stress assignment is perhaps the weakest link
in all systems because an incorrect stress pattern, while perceptually
disruptive in and of itself, usually also triggers mis-selection of
vowel qualities. The newer systems not only base stress assignment on
factors such as morphological structure and the distinction between
strong and weak syllables (Chomsky and Halle, 1968), but also on
presumed part of speech, and in some cases, etymology (for a good
review, see Church, 1985). The importance of syntactic categorization
is
|
KLATT 1987, p. 771 |
SSSHP Contents | Labs | |
Smithsonian Speech Synthesis History Project | |
National Museum of American History | Archives Center | |
Smithsonian Institution | Privacy | Terms of Use |