NMAH | Smithsonian Speech Synthesis History Project (dk

then single letters were converted to phonemic form. For each letter, rules were ordered so that the first rules treated special cases of complex environmental specification, and the last case was always a default phonemic correspondence. For example, a rule might say that the letter A => /e/ if followed by VE. The rule treats correctly words like "behave," but not "have." A slightly more complicated variant of the same idea was to convert consonants first (Hunnicutt, 1976). This permitted the phonemic representations of consonants to be used in the context specifications for the more difficult conversion of vowels. Systems of this kind may have more than 500 such rules for the interpretation of letter strings.

Several major problems were immediately apparent: (1) vowel conversion depended in part on stress pattern, (2) correct analysis often required detection of morpheme boundaries, and (3) letter contexts had structural properties such as VC vs VCC that one would rather refer to instead of enumerating all possible letter sequences. Before discussing how these and the next generation of spelling conversion programs dealt with these issues, we consider a novel approach to the problem that has received considerable attention in the artificial intelligence community.

The conversion of letters to phonemes might appear to be a pattern matching problem amenable to statistical learning strategies. For example, Sejnowski and Rosenberg (1986) considered the problem of creating a network, which they called NETtalk, that takes a seven-letter window as input and outputs the phoneme corresponding to the middle letter. A set of 120 "hidden" neuron-like threshold elements mediated between input neurons corresponding to 29 possible letters at each of seven positions and an output set of neurons representing about 40 phonemes and two degrees of stress. The weighting of input connections and output connections of the hidden units was initially random, but was adjusted through incremental training on a 20 000 word phonemic dictionary. When evaluated on the words of this training set, the network was correct for about 90% of the phonemes and stress patterns. In some sense, this is a surprisingly good result in that so much knowledge could be embedded in a moderate number of about 25 000 weights, but the performance is not nearly as accurate as that of a good set of letter-to-sound rules (performing without use of an exceptions dictionary, but with rules for recognizing common affixes). A typical knowledge-based rule system (Bernstein and Pisoni, 1980) is claimed to perform at about

	SSSHP Contents \| Labs
Smithsonian Speech Synthesis History Project
National Museum of American History \| Archives Center
Smithsonian Institution \| Privacy \| Terms of Use