KLATT 1987, p. 767 |
evoked or the detailed acoustic effects of the transformation. For example, only the fronting/ backing anticipation of the interacting vowels in a VCV sequence described by Öhman has been implemented as a change to the F2 trajectory. The list of discrete allophones inserted and manipulated by internal Klattalk rules, shown in Table III, is rather small. All other allophonic variation is created by modifying synthesizer control parameter data directly rather than by defining a discrete symbol.
In summary, it is likely that this area of allophonic detail and
prosodic specification is one of the weaker aspects of rule systems,
and contributes significantly to the perception of unnaturalness
attributed to synthetic speech. Incremental improvements that are
made to these rules on the basis of comparisons between rule output
and natural speech cannot help but lead to improved performance of
text-to-speech systems. II. TEXT-TO-PHONEMES CONVERSIONHaving considered the steps required to go from an abstract linguistic description to synthetic speech, we now turn to the problem of deriving this description from text. The recognition of printed characters, as required in, e.g., a reading machine for the blind, is beyond the scope of this review. We will assume that an ASCII representation of each input sentence is available as input to the text analysis module of a text-to-speech system. From considerations outlined in the previous section, it is clear that the text analysis routines have a formidable task. Ideally, the input is to be analyzed in such a way as to:
For example, the input ASCII string for a typical input sentence,
shown below, was processed by rules of Klattalk to derive an abstract
linguistic representation consisting of phonemes, stress, and syntactic
symbols. First, the word-formatting module transformed the numerals
"23" into the words "twenty-three." | ||||||||||||||||||||||||
| ||||||||||||||||||||||||
There is no semantic analysis in Klattalk or any other present-day text-to-speech system. Every sentence is spoken in a sort of semantically "neutral" way, i.e., without emphatic or contrastive stress, unless the user indicates an important word by placing the phonemic [ " ] symbol before it in the orthography.
Next, a phonemic representation is obtained for the words in the
manner shown in Fig. 30.
Each word is compared with entries in a
small pronunciation dictionary. If no match is found, the word is
broken into smaller pieces (morphemes) by attempting to remove
common suffixes such as "-ed," "-ing," etc. It may be
necessary to add a silent "e" or
to reconstitute the "y" in order to recover the true form of the
root, as in "biting = bite + ing." Then the remaining root is again
compared with entries in the phonemic dictionary. If there is still
no match, a set of letter-to-phoneme rules are invoked to predict the
pronunciation. In this sentence, two words had affixes removed, five
words/roots were found in the dictionary, and the remaining one was
processed by letter-to-sound rules. No errors were made. The morpheme
"protest" was found to have two alternative pronunciations in the
dictionary, one with primary stress on the first syllable and the
other with primary stress on the second syllable, but a selectional
restriction associated with the "-er" suffix caused correct selection
of the noun form.
|
KLATT 1987, p. 767 |
SSSHP Contents | Labs | |
Smithsonian Speech Synthesis History Project | |
National Museum of American History | Archives Center | |
Smithsonian Institution | Privacy | Terms of Use |